Initial import

2026-03-06 09:15:55 +01:00 · 2020-03-31 17:15:23 +02:00
commit d97ea9b0bc
311 changed files with 131419 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
 __pycache__
--- a/674
+++ b/674
@@ -0,0 +1,674 @@
                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007
 Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.
                            Preamble
  The GNU General Public License is a free, copyleft license for
 software and other kinds of works.
  The licenses for most software and other practical works are designed
 to take away your freedom to share and change the works.  By contrast,
 the GNU General Public License is intended to guarantee your freedom to
 share and change all versions of a program--to make sure it remains free
 software for all its users.  We, the Free Software Foundation, use the
 GNU General Public License for most of our software; it applies also to
 any other work released this way by its authors.  You can apply it to
 your programs, too.
  When we speak of free software, we are referring to freedom, not
 price.  Our General Public Licenses are designed to make sure that you
 have the freedom to distribute copies of free software (and charge for
 them if you wish), that you receive source code or can get it if you
 want it, that you can change the software or use pieces of it in new
 free programs, and that you know you can do these things.
  To protect your rights, we need to prevent others from denying you
 these rights or asking you to surrender the rights.  Therefore, you have
 certain responsibilities if you distribute copies of the software, or if
 you modify it: responsibilities to respect the freedom of others.
  For example, if you distribute copies of such a program, whether
 gratis or for a fee, you must pass on to the recipients the same
 freedoms that you received.  You must make sure that they, too, receive
 or can get the source code.  And you must show them these terms so they
 know their rights.
  Developers that use the GNU GPL protect your rights with two steps:
 (1) assert copyright on the software, and (2) offer you this License
 giving you legal permission to copy, distribute and/or modify it.
  For the developers' and authors' protection, the GPL clearly explains
 that there is no warranty for this free software.  For both users' and
 authors' sake, the GPL requires that modified versions be marked as
 changed, so that their problems will not be attributed erroneously to
 authors of previous versions.
  Some devices are designed to deny users access to install or run
 modified versions of the software inside them, although the manufacturer
 can do so.  This is fundamentally incompatible with the aim of
 protecting users' freedom to change the software.  The systematic
 pattern of such abuse occurs in the area of products for individuals to
 use, which is precisely where it is most unacceptable.  Therefore, we
 have designed this version of the GPL to prohibit the practice for those
 products.  If such problems arise substantially in other domains, we
 stand ready to extend this provision to those domains in future versions
 of the GPL, as needed to protect the freedom of users.
  Finally, every program is threatened constantly by software patents.
 States should not allow patents to restrict development and use of
 software on general-purpose computers, but in those that do, we wish to
 avoid the special danger that patents applied to a free program could
 make it effectively proprietary.  To prevent this, the GPL assures that
 patents cannot be used to render the program non-free.
  The precise terms and conditions for copying, distribution and
 modification follow.
                       TERMS AND CONDITIONS
  0. Definitions.
  "This License" refers to version 3 of the GNU General Public License.
  "Copyright" also means copyright-like laws that apply to other kinds of
 works, such as semiconductor masks.
  "The Program" refers to any copyrightable work licensed under this
 License.  Each licensee is addressed as "you".  "Licensees" and
 "recipients" may be individuals or organizations.
  To "modify" a work means to copy from or adapt all or part of the work
 in a fashion requiring copyright permission, other than the making of an
 exact copy.  The resulting work is called a "modified version" of the
 earlier work or a work "based on" the earlier work.
  A "covered work" means either the unmodified Program or a work based
 on the Program.
  To "propagate" a work means to do anything with it that, without
 permission, would make you directly or secondarily liable for
 infringement under applicable copyright law, except executing it on a
 computer or modifying a private copy.  Propagation includes copying,
 distribution (with or without modification), making available to the
 public, and in some countries other activities as well.
  To "convey" a work means any kind of propagation that enables other
 parties to make or receive copies.  Mere interaction with a user through
 a computer network, with no transfer of a copy, is not conveying.
  An interactive user interface displays "Appropriate Legal Notices"
 to the extent that it includes a convenient and prominently visible
 feature that (1) displays an appropriate copyright notice, and (2)
 tells the user that there is no warranty for the work (except to the
 extent that warranties are provided), that licensees may convey the
 work under this License, and how to view a copy of this License.  If
 the interface presents a list of user commands or options, such as a
 menu, a prominent item in the list meets this criterion.
  1. Source Code.
  The "source code" for a work means the preferred form of the work
 for making modifications to it.  "Object code" means any non-source
 form of a work.
  A "Standard Interface" means an interface that either is an official
 standard defined by a recognized standards body, or, in the case of
 interfaces specified for a particular programming language, one that
 is widely used among developers working in that language.
  The "System Libraries" of an executable work include anything, other
 than the work as a whole, that (a) is included in the normal form of
 packaging a Major Component, but which is not part of that Major
 Component, and (b) serves only to enable use of the work with that
 Major Component, or to implement a Standard Interface for which an
 implementation is available to the public in source code form.  A
 "Major Component", in this context, means a major essential component
 (kernel, window system, and so on) of the specific operating system
 (if any) on which the executable work runs, or a compiler used to
 produce the work, or an object code interpreter used to run it.
  The "Corresponding Source" for a work in object code form means all
 the source code needed to generate, install, and (for an executable
 work) run the object code and to modify the work, including scripts to
 control those activities.  However, it does not include the work's
 System Libraries, or general-purpose tools or generally available free
 programs which are used unmodified in performing those activities but
 which are not part of the work.  For example, Corresponding Source
 includes interface definition files associated with source files for
 the work, and the source code for shared libraries and dynamically
 linked subprograms that the work is specifically designed to require,
 such as by intimate data communication or control flow between those
 subprograms and other parts of the work.
  The Corresponding Source need not include anything that users
 can regenerate automatically from other parts of the Corresponding
 Source.
  The Corresponding Source for a work in source code form is that
 same work.
  2. Basic Permissions.
  All rights granted under this License are granted for the term of
 copyright on the Program, and are irrevocable provided the stated
 conditions are met.  This License explicitly affirms your unlimited
 permission to run the unmodified Program.  The output from running a
 covered work is covered by this License only if the output, given its
 content, constitutes a covered work.  This License acknowledges your
 rights of fair use or other equivalent, as provided by copyright law.
  You may make, run and propagate covered works that you do not
 convey, without conditions so long as your license otherwise remains
 in force.  You may convey covered works to others for the sole purpose
 of having them make modifications exclusively for you, or provide you
 with facilities for running those works, provided that you comply with
 the terms of this License in conveying all material for which you do
 not control copyright.  Those thus making or running the covered works
 for you must do so exclusively on your behalf, under your direction
 and control, on terms that prohibit them from making any copies of
 your copyrighted material outside their relationship with you.
  Conveying under any other circumstances is permitted solely under
 the conditions stated below.  Sublicensing is not allowed; section 10
 makes it unnecessary.
  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
  No covered work shall be deemed part of an effective technological
 measure under any applicable law fulfilling obligations under article
 11 of the WIPO copyright treaty adopted on 20 December 1996, or
 similar laws prohibiting or restricting circumvention of such
 measures.
  When you convey a covered work, you waive any legal power to forbid
 circumvention of technological measures to the extent such circumvention
 is effected by exercising rights under this License with respect to
 the covered work, and you disclaim any intention to limit operation or
 modification of the work as a means of enforcing, against the work's
 users, your or third parties' legal rights to forbid circumvention of
 technological measures.
  4. Conveying Verbatim Copies.
  You may convey verbatim copies of the Program's source code as you
 receive it, in any medium, provided that you conspicuously and
 appropriately publish on each copy an appropriate copyright notice;
 keep intact all notices stating that this License and any
 non-permissive terms added in accord with section 7 apply to the code;
 keep intact all notices of the absence of any warranty; and give all
 recipients a copy of this License along with the Program.
  You may charge any price or no price for each copy that you convey,
 and you may offer support or warranty protection for a fee.
  5. Conveying Modified Source Versions.
  You may convey a work based on the Program, or the modifications to
 produce it from the Program, in the form of source code under the
 terms of section 4, provided that you also meet all of these conditions:
    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.
    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".
    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.
    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.
  A compilation of a covered work with other separate and independent
 works, which are not by their nature extensions of the covered work,
 and which are not combined with it such as to form a larger program,
 in or on a volume of a storage or distribution medium, is called an
 "aggregate" if the compilation and its resulting copyright are not
 used to limit the access or legal rights of the compilation's users
 beyond what the individual works permit.  Inclusion of a covered work
 in an aggregate does not cause this License to apply to the other
 parts of the aggregate.
  6. Conveying Non-Source Forms.
  You may convey a covered work in object code form under the terms
 of sections 4 and 5, provided that you also convey the
 machine-readable Corresponding Source under the terms of this License,
 in one of these ways:
    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.
    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.
    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.
    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.
    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.
  A separable portion of the object code, whose source code is excluded
 from the Corresponding Source as a System Library, need not be
 included in conveying the object code work.
  A "User Product" is either (1) a "consumer product", which means any
 tangible personal property which is normally used for personal, family,
 or household purposes, or (2) anything designed or sold for incorporation
 into a dwelling.  In determining whether a product is a consumer product,
 doubtful cases shall be resolved in favor of coverage.  For a particular
 product received by a particular user, "normally used" refers to a
 typical or common use of that class of product, regardless of the status
 of the particular user or of the way in which the particular user
 actually uses, or expects or is expected to use, the product.  A product
 is a consumer product regardless of whether the product has substantial
 commercial, industrial or non-consumer uses, unless such uses represent
 the only significant mode of use of the product.
  "Installation Information" for a User Product means any methods,
 procedures, authorization keys, or other information required to install
 and execute modified versions of a covered work in that User Product from
 a modified version of its Corresponding Source.  The information must
 suffice to ensure that the continued functioning of the modified object
 code is in no case prevented or interfered with solely because
 modification has been made.
  If you convey an object code work under this section in, or with, or
 specifically for use in, a User Product, and the conveying occurs as
 part of a transaction in which the right of possession and use of the
 User Product is transferred to the recipient in perpetuity or for a
 fixed term (regardless of how the transaction is characterized), the
 Corresponding Source conveyed under this section must be accompanied
 by the Installation Information.  But this requirement does not apply
 if neither you nor any third party retains the ability to install
 modified object code on the User Product (for example, the work has
 been installed in ROM).
  The requirement to provide Installation Information does not include a
 requirement to continue to provide support service, warranty, or updates
 for a work that has been modified or installed by the recipient, or for
 the User Product in which it has been modified or installed.  Access to a
 network may be denied when the modification itself materially and
 adversely affects the operation of the network or violates the rules and
 protocols for communication across the network.
  Corresponding Source conveyed, and Installation Information provided,
 in accord with this section must be in a format that is publicly
 documented (and with an implementation available to the public in
 source code form), and must require no special password or key for
 unpacking, reading or copying.
  7. Additional Terms.
  "Additional permissions" are terms that supplement the terms of this
 License by making exceptions from one or more of its conditions.
 Additional permissions that are applicable to the entire Program shall
 be treated as though they were included in this License, to the extent
 that they are valid under applicable law.  If additional permissions
 apply only to part of the Program, that part may be used separately
 under those permissions, but the entire Program remains governed by
 this License without regard to the additional permissions.
  When you convey a copy of a covered work, you may at your option
 remove any additional permissions from that copy, or from any part of
 it.  (Additional permissions may be written to require their own
 removal in certain cases when you modify the work.)  You may place
 additional permissions on material, added by you to a covered work,
 for which you have or can give appropriate copyright permission.
  Notwithstanding any other provision of this License, for material you
 add to a covered work, you may (if authorized by the copyright holders of
 that material) supplement the terms of this License with terms:
    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or
    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or
    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or
    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or
    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or
    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.
  All other non-permissive additional terms are considered "further
 restrictions" within the meaning of section 10.  If the Program as you
 received it, or any part of it, contains a notice stating that it is
 governed by this License along with a term that is a further
 restriction, you may remove that term.  If a license document contains
 a further restriction but permits relicensing or conveying under this
 License, you may add to a covered work material governed by the terms
 of that license document, provided that the further restriction does
 not survive such relicensing or conveying.
  If you add terms to a covered work in accord with this section, you
 must place, in the relevant source files, a statement of the
 additional terms that apply to those files, or a notice indicating
 where to find the applicable terms.
  Additional terms, permissive or non-permissive, may be stated in the
 form of a separately written license, or stated as exceptions;
 the above requirements apply either way.
  8. Termination.
  You may not propagate or modify a covered work except as expressly
 provided under this License.  Any attempt otherwise to propagate or
 modify it is void, and will automatically terminate your rights under
 this License (including any patent licenses granted under the third
 paragraph of section 11).
  However, if you cease all violation of this License, then your
 license from a particular copyright holder is reinstated (a)
 provisionally, unless and until the copyright holder explicitly and
 finally terminates your license, and (b) permanently, if the copyright
 holder fails to notify you of the violation by some reasonable means
 prior to 60 days after the cessation.
  Moreover, your license from a particular copyright holder is
 reinstated permanently if the copyright holder notifies you of the
 violation by some reasonable means, this is the first time you have
 received notice of violation of this License (for any work) from that
 copyright holder, and you cure the violation prior to 30 days after
 your receipt of the notice.
  Termination of your rights under this section does not terminate the
 licenses of parties who have received copies or rights from you under
 this License.  If your rights have been terminated and not permanently
 reinstated, you do not qualify to receive new licenses for the same
 material under section 10.
  9. Acceptance Not Required for Having Copies.
  You are not required to accept this License in order to receive or
 run a copy of the Program.  Ancillary propagation of a covered work
 occurring solely as a consequence of using peer-to-peer transmission
 to receive a copy likewise does not require acceptance.  However,
 nothing other than this License grants you permission to propagate or
 modify any covered work.  These actions infringe copyright if you do
 not accept this License.  Therefore, by modifying or propagating a
 covered work, you indicate your acceptance of this License to do so.
  10. Automatic Licensing of Downstream Recipients.
  Each time you convey a covered work, the recipient automatically
 receives a license from the original licensors, to run, modify and
 propagate that work, subject to this License.  You are not responsible
 for enforcing compliance by third parties with this License.
  An "entity transaction" is a transaction transferring control of an
 organization, or substantially all assets of one, or subdividing an
 organization, or merging organizations.  If propagation of a covered
 work results from an entity transaction, each party to that
 transaction who receives a copy of the work also receives whatever
 licenses to the work the party's predecessor in interest had or could
 give under the previous paragraph, plus a right to possession of the
 Corresponding Source of the work from the predecessor in interest, if
 the predecessor has it or can get it with reasonable efforts.
  You may not impose any further restrictions on the exercise of the
 rights granted or affirmed under this License.  For example, you may
 not impose a license fee, royalty, or other charge for exercise of
 rights granted under this License, and you may not initiate litigation
 (including a cross-claim or counterclaim in a lawsuit) alleging that
 any patent claim is infringed by making, using, selling, offering for
 sale, or importing the Program or any portion of it.
  11. Patents.
  A "contributor" is a copyright holder who authorizes use under this
 License of the Program or a work on which the Program is based.  The
 work thus licensed is called the contributor's "contributor version".
  A contributor's "essential patent claims" are all patent claims
 owned or controlled by the contributor, whether already acquired or
 hereafter acquired, that would be infringed by some manner, permitted
 by this License, of making, using, or selling its contributor version,
 but do not include claims that would be infringed only as a
 consequence of further modification of the contributor version.  For
 purposes of this definition, "control" includes the right to grant
 patent sublicenses in a manner consistent with the requirements of
 this License.
  Each contributor grants you a non-exclusive, worldwide, royalty-free
 patent license under the contributor's essential patent claims, to
 make, use, sell, offer for sale, import and otherwise run, modify and
 propagate the contents of its contributor version.
  In the following three paragraphs, a "patent license" is any express
 agreement or commitment, however denominated, not to enforce a patent
 (such as an express permission to practice a patent or covenant not to
 sue for patent infringement).  To "grant" such a patent license to a
 party means to make such an agreement or commitment not to enforce a
 patent against the party.
  If you convey a covered work, knowingly relying on a patent license,
 and the Corresponding Source of the work is not available for anyone
 to copy, free of charge and under the terms of this License, through a
 publicly available network server or other readily accessible means,
 then you must either (1) cause the Corresponding Source to be so
 available, or (2) arrange to deprive yourself of the benefit of the
 patent license for this particular work, or (3) arrange, in a manner
 consistent with the requirements of this License, to extend the patent
 license to downstream recipients.  "Knowingly relying" means you have
 actual knowledge that, but for the patent license, your conveying the
 covered work in a country, or your recipient's use of the covered work
 in a country, would infringe one or more identifiable patents in that
 country that you have reason to believe are valid.
  If, pursuant to or in connection with a single transaction or
 arrangement, you convey, or propagate by procuring conveyance of, a
 covered work, and grant a patent license to some of the parties
 receiving the covered work authorizing them to use, propagate, modify
 or convey a specific copy of the covered work, then the patent license
 you grant is automatically extended to all recipients of the covered
 work and works based on it.
  A patent license is "discriminatory" if it does not include within
 the scope of its coverage, prohibits the exercise of, or is
 conditioned on the non-exercise of one or more of the rights that are
 specifically granted under this License.  You may not convey a covered
 work if you are a party to an arrangement with a third party that is
 in the business of distributing software, under which you make payment
 to the third party based on the extent of your activity of conveying
 the work, and under which the third party grants, to any of the
 parties who would receive the covered work from you, a discriminatory
 patent license (a) in connection with copies of the covered work
 conveyed by you (or copies made from those copies), or (b) primarily
 for and in connection with specific products or compilations that
 contain the covered work, unless you entered into that arrangement,
 or that patent license was granted, prior to 28 March 2007.
  Nothing in this License shall be construed as excluding or limiting
 any implied license or other defenses to infringement that may
 otherwise be available to you under applicable patent law.
  12. No Surrender of Others' Freedom.
  If conditions are imposed on you (whether by court order, agreement or
 otherwise) that contradict the conditions of this License, they do not
 excuse you from the conditions of this License.  If you cannot convey a
 covered work so as to satisfy simultaneously your obligations under this
 License and any other pertinent obligations, then as a consequence you may
 not convey it at all.  For example, if you agree to terms that obligate you
 to collect a royalty for further conveying from those to whom you convey
 the Program, the only way you could satisfy both those terms and this
 License would be to refrain entirely from conveying the Program.
  13. Use with the GNU Affero General Public License.
  Notwithstanding any other provision of this License, you have
 permission to link or combine any covered work with a work licensed
 under version 3 of the GNU Affero General Public License into a single
 combined work, and to convey the resulting work.  The terms of this
 License will continue to apply to the part which is the covered work,
 but the special requirements of the GNU Affero General Public License,
 section 13, concerning interaction through a network will apply to the
 combination as such.
  14. Revised Versions of this License.
  The Free Software Foundation may publish revised and/or new versions of
 the GNU General Public License from time to time.  Such new versions will
 be similar in spirit to the present version, but may differ in detail to
 address new problems or concerns.
  Each version is given a distinguishing version number.  If the
 Program specifies that a certain numbered version of the GNU General
 Public License "or any later version" applies to it, you have the
 option of following the terms and conditions either of that numbered
 version or of any later version published by the Free Software
 Foundation.  If the Program does not specify a version number of the
 GNU General Public License, you may choose any version ever published
 by the Free Software Foundation.
  If the Program specifies that a proxy can decide which future
 versions of the GNU General Public License can be used, that proxy's
 public statement of acceptance of a version permanently authorizes you
 to choose that version for the Program.
  Later license versions may give you additional or different
 permissions.  However, no additional obligations are imposed on any
 author or copyright holder as a result of your choosing to follow a
 later version.
  15. Disclaimer of Warranty.
  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
 APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
 HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
 OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
 IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
 ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
  16. Limitation of Liability.
  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
 THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
 GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
 USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
 DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
 PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
 EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGES.
  17. Interpretation of Sections 15 and 16.
  If the disclaimer of warranty and limitation of liability provided
 above cannot be given local legal effect according to their terms,
 reviewing courts shall apply local law that most closely approximates
 an absolute waiver of all civil liability in connection with the
 Program, unless a warranty or assumption of liability accompanies a
 copy of the Program in return for a fee.
                     END OF TERMS AND CONDITIONS
            How to Apply These Terms to Your New Programs
  If you develop a new program, and you want it to be of the greatest
 possible use to the public, the best way to achieve this is to make it
 free software which everyone can redistribute and change under these terms.
  To do so, attach the following notices to the program.  It is safest
 to attach them to the start of each source file to most effectively
 state the exclusion of warranty; and each file should have at least
 the "copyright" line and a pointer to where the full notice is found.
    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.
    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 Also add information on how to contact you by electronic and paper mail.
  If the program does terminal interaction, make it output a short
 notice like this when it starts in an interactive mode:
    <program>  Copyright (C) <year>  <name of author>
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.
 The hypothetical commands `show w' and `show c' should show the appropriate
 parts of the General Public License.  Of course, your program's commands
 might be different; for a GUI interface, you would use an "about box".
  You should also get your employer (if you work as a programmer) or school,
 if any, to sign a "copyright disclaimer" for the program, if necessary.
 For more information on this, and how to apply and follow the GNU GPL, see
 <https://www.gnu.org/licenses/>.
  The GNU General Public License does not permit incorporating your program
 into proprietary programs.  If your program is a subroutine library, you
 may consider it more useful to permit linking proprietary applications with
 the library.  If this is what you want to do, use the GNU Lesser General
 Public License instead of this License.  But first, please read
 <https://www.gnu.org/philosophy/why-not-lgpl.html>.
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,26 @@
 ===============
 Ebook converter
 ===============
 This is impudent ripoff of the bits from `Calibre project`_, and is aimed only
 for converter thing.
 My motivation is to have only converter for ebooks run from commandline,
 without all of those bells and whistles Calibre have, and with cleanest more
 *pythonic* approach.
 Installation
 ------------
 TBD.
 License
 -------
 This work is licensed on GPL3 license, like the original work. See LICENSE file
 for details.
 .. _Calibre project: https://calibre-ebook.com/
--- a/ebook_converter/init.py
+++ b/ebook_converter/init.py
@@ -0,0 +1,681 @@
 from __future__ import unicode_literals, print_function
 ''' E-book management software'''
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import sys, os, re, time, random, warnings
 from polyglot.builtins import codepoint_to_chr, unicode_type, range, hasenv, native_string_type
 from math import floor
 from functools import partial
 if not hasenv('CALIBRE_SHOW_DEPRECATION_WARNINGS'):
    warnings.simplefilter('ignore', DeprecationWarning)
 try:
    os.getcwd()
 except EnvironmentError:
    os.chdir(os.path.expanduser('~'))
 from calibre.constants import (iswindows, isosx, islinux, isfrozen,
        isbsd, preferred_encoding, __appname__, __version__, __author__,
        win32event, win32api, winerror, fcntl, ispy3,
        filesystem_encoding, plugins, config_dir)
 from calibre.startup import winutil, winutilerror
 from calibre.utils.icu import safe_chr
 if False:
    # Prevent pyflakes from complaining
    winutil, winutilerror, __appname__, islinux, __version__
    fcntl, win32event, isfrozen, __author__
    winerror, win32api, isbsd, config_dir
 _mt_inited = False
 def _init_mimetypes():
    global _mt_inited
    import mimetypes
    mimetypes.init([P('mime.types')])
    _mt_inited = True
 def guess_type(*args, **kwargs):
    import mimetypes
    if not _mt_inited:
        _init_mimetypes()
    return mimetypes.guess_type(*args, **kwargs)
 def guess_all_extensions(*args, **kwargs):
    import mimetypes
    if not _mt_inited:
        _init_mimetypes()
    return mimetypes.guess_all_extensions(*args, **kwargs)
 def guess_extension(*args, **kwargs):
    import mimetypes
    if not _mt_inited:
        _init_mimetypes()
    ext = mimetypes.guess_extension(*args, **kwargs)
    if not ext and args and args[0] == 'application/x-palmreader':
        ext = '.pdb'
    return ext
 def get_types_map():
    import mimetypes
    if not _mt_inited:
        _init_mimetypes()
    return mimetypes.types_map
 def to_unicode(raw, encoding='utf-8', errors='strict'):
    if isinstance(raw, unicode_type):
        return raw
    return raw.decode(encoding, errors)
 def patheq(p1, p2):
    p = os.path
    d = lambda x : p.normcase(p.normpath(p.realpath(p.normpath(x))))
    if not p1 or not p2:
        return False
    return d(p1) == d(p2)
 def unicode_path(path, abs=False):
    if isinstance(path, bytes):
        path = path.decode(filesystem_encoding)
    if abs:
        path = os.path.abspath(path)
    return path
 def osx_version():
    if isosx:
        import platform
        src = platform.mac_ver()[0]
        m = re.match(r'(\d+)\.(\d+)\.(\d+)', src)
        if m:
            return int(m.group(1)), int(m.group(2)), int(m.group(3))
 def confirm_config_name(name):
    return name + '_again'
 _filename_sanitize_unicode = frozenset(('\\', '|', '?', '*', '<',        # no2to3
    '"', ':', '>', '+', '/') + tuple(map(codepoint_to_chr, range(32))))  # no2to3
 def sanitize_file_name(name, substitute='_'):
    '''
    Sanitize the filename `name`. All invalid characters are replaced by `substitute`.
    The set of invalid characters is the union of the invalid characters in Windows,
    macOS and Linux. Also removes leading and trailing whitespace.
    **WARNING:** This function also replaces path separators, so only pass file names
    and not full paths to it.
    '''
    if isbytestring(name):
        name = name.decode(filesystem_encoding, 'replace')
    if isbytestring(substitute):
        substitute = substitute.decode(filesystem_encoding, 'replace')
    chars = (substitute if c in _filename_sanitize_unicode else c for c in name)
    one = ''.join(chars)
    one = re.sub(r'\s', ' ', one).strip()
    bname, ext = os.path.splitext(one)
    one = re.sub(r'^\.+$', '_', bname)
    one = one.replace('..', substitute)
    one += ext
    # Windows doesn't like path components that end with a period or space
    if one and one[-1] in ('.', ' '):
        one = one[:-1]+'_'
    # Names starting with a period are hidden on Unix
    if one.startswith('.'):
        one = '_' + one[1:]
    return one
 sanitize_file_name2 = sanitize_file_name_unicode = sanitize_file_name
 def prints(*args, **kwargs):
    '''
    Print unicode arguments safely by encoding them to preferred_encoding
    Has the same signature as the print function from Python 3, except for the
    additional keyword argument safe_encode, which if set to True will cause the
    function to use repr when encoding fails.
    Returns the number of bytes written.
    '''
    file = kwargs.get('file', sys.stdout)
    file = getattr(file, 'buffer', file)
    enc = 'utf-8' if hasenv('CALIBRE_WORKER') else preferred_encoding
    sep  = kwargs.get('sep', ' ')
    if not isinstance(sep, bytes):
        sep = sep.encode(enc)
    end  = kwargs.get('end', '\n')
    if not isinstance(end, bytes):
        end = end.encode(enc)
    safe_encode = kwargs.get('safe_encode', False)
    count = 0
    for i, arg in enumerate(args):
        if isinstance(arg, unicode_type):
            if iswindows:
                from calibre.utils.terminal import Detect
                cs = Detect(file)
                if cs.is_console:
                    cs.write_unicode_text(arg)
                    count += len(arg)
                    if i != len(args)-1:
                        file.write(sep)
                        count += len(sep)
                    continue
            try:
                arg = arg.encode(enc)
            except UnicodeEncodeError:
                try:
                    arg = arg.encode('utf-8')
                except:
                    if not safe_encode:
                        raise
                    arg = repr(arg)
        if not isinstance(arg, bytes):
            try:
                arg = native_string_type(arg)
            except ValueError:
                arg = unicode_type(arg)
            if isinstance(arg, unicode_type):
                try:
                    arg = arg.encode(enc)
                except UnicodeEncodeError:
                    try:
                        arg = arg.encode('utf-8')
                    except:
                        if not safe_encode:
                            raise
                        arg = repr(arg)
        try:
            file.write(arg)
            count += len(arg)
        except:
            from polyglot import reprlib
            arg = reprlib.repr(arg)
            file.write(arg)
            count += len(arg)
        if i != len(args)-1:
            file.write(sep)
            count += len(sep)
    file.write(end)
    count += len(end)
    return count
 class CommandLineError(Exception):
    pass
 def setup_cli_handlers(logger, level):
    import logging
    if hasenv('CALIBRE_WORKER') and logger.handlers:
        return
    logger.setLevel(level)
    if level == logging.WARNING:
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(logging.Formatter('%(levelname)s: %(message)s'))
        handler.setLevel(logging.WARNING)
    elif level == logging.INFO:
        handler = logging.StreamHandler(sys.stdout)
        handler.setFormatter(logging.Formatter())
        handler.setLevel(logging.INFO)
    elif level == logging.DEBUG:
        handler = logging.StreamHandler(sys.stderr)
        handler.setLevel(logging.DEBUG)
        handler.setFormatter(logging.Formatter('[%(levelname)s] %(filename)s:%(lineno)s: %(message)s'))
    logger.addHandler(handler)
 def load_library(name, cdll):
    if iswindows:
        return cdll.LoadLibrary(name)
    if isosx:
        name += '.dylib'
        if hasattr(sys, 'frameworks_dir'):
            return cdll.LoadLibrary(os.path.join(getattr(sys, 'frameworks_dir'), name))
        return cdll.LoadLibrary(name)
    return cdll.LoadLibrary(name+'.so')
 def extract(path, dir):
    extractor = None
    # First use the file header to identify its type
    with open(path, 'rb') as f:
        id_ = f.read(3)
    if id_ == b'Rar':
        from calibre.utils.unrar import extract as rarextract
        extractor = rarextract
    elif id_.startswith(b'PK'):
        from calibre.libunzip import extract as zipextract
        extractor = zipextract
    if extractor is None:
        # Fallback to file extension
        ext = os.path.splitext(path)[1][1:].lower()
        if ext in ['zip', 'cbz', 'epub', 'oebzip']:
            from calibre.libunzip import extract as zipextract
            extractor = zipextract
        elif ext in ['cbr', 'rar']:
            from calibre.utils.unrar import extract as rarextract
            extractor = rarextract
    if extractor is None:
        raise Exception('Unknown archive type')
    extractor(path, dir)
 def get_proxies(debug=True):
    from polyglot.urllib import getproxies
    proxies = getproxies()
    for key, proxy in list(proxies.items()):
        if not proxy or '..' in proxy or key == 'auto':
            del proxies[key]
            continue
        if proxy.startswith(key+'://'):
            proxy = proxy[len(key)+3:]
        if key == 'https' and proxy.startswith('http://'):
            proxy = proxy[7:]
        if proxy.endswith('/'):
            proxy = proxy[:-1]
        if len(proxy) > 4:
            proxies[key] = proxy
        else:
            prints('Removing invalid', key, 'proxy:', proxy)
            del proxies[key]
    if proxies and debug:
        prints('Using proxies:', proxies)
    return proxies
 def get_parsed_proxy(typ='http', debug=True):
    proxies = get_proxies(debug)
    proxy = proxies.get(typ, None)
    if proxy:
        pattern = re.compile((
            '(?:ptype://)?'
            '(?:(?P<user>\\w+):(?P<pass>.*)@)?'
            '(?P<host>[\\w\\-\\.]+)'
            '(?::(?P<port>\\d+))?').replace('ptype', typ)
        )
        match = pattern.match(proxies[typ])
        if match:
            try:
                ans = {
                        'host' : match.group('host'),
                        'port' : match.group('port'),
                        'user' : match.group('user'),
                        'pass' : match.group('pass')
                    }
                if ans['port']:
                    ans['port'] = int(ans['port'])
            except:
                if debug:
                    import traceback
                    traceback.print_exc()
            else:
                if debug:
                    prints('Using http proxy', unicode_type(ans))
                return ans
 def get_proxy_info(proxy_scheme, proxy_string):
    '''
    Parse all proxy information from a proxy string (as returned by
    get_proxies). The returned dict will have members set to None when the info
    is not available in the string. If an exception occurs parsing the string
    this method returns None.
    '''
    from polyglot.urllib import urlparse
    try:
        proxy_url = '%s://%s'%(proxy_scheme, proxy_string)
        urlinfo = urlparse(proxy_url)
        ans = {
            'scheme': urlinfo.scheme,
            'hostname': urlinfo.hostname,
            'port': urlinfo.port,
            'username': urlinfo.username,
            'password': urlinfo.password,
        }
    except Exception:
        return None
    return ans
 # IE 11 on windows 7
 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko'
 USER_AGENT_MOBILE = 'Mozilla/5.0 (Windows; U; Windows CE 5.1; rv:1.8.1a3) Gecko/20060610 Minimo/0.016'
 def is_mobile_ua(ua):
    return 'Mobile/' in ua or 'Mobile ' in ua
 def random_user_agent(choose=None, allow_ie=True):
    from calibre.utils.random_ua import common_user_agents
    ua_list = common_user_agents()
    ua_list = [x for x in ua_list if not is_mobile_ua(x)]
    if not allow_ie:
        ua_list = [x for x in ua_list if 'Trident/' not in x and 'Edge/' not in x]
    return random.choice(ua_list) if choose is None else ua_list[choose]
 def browser(honor_time=True, max_time=2, mobile_browser=False, user_agent=None, verify_ssl_certificates=True, handle_refresh=True):
    '''
    Create a mechanize browser for web scraping. The browser handles cookies,
    refresh requests and ignores robots.txt. Also uses proxy if available.
    :param honor_time: If True honors pause time in refresh requests
    :param max_time: Maximum time in seconds to wait during a refresh request
    :param verify_ssl_certificates: If false SSL certificates errors are ignored
    '''
    from calibre.utils.browser import Browser
    opener = Browser(verify_ssl=verify_ssl_certificates)
    opener.set_handle_refresh(handle_refresh, max_time=max_time, honor_time=honor_time)
    opener.set_handle_robots(False)
    if user_agent is None:
        user_agent = USER_AGENT_MOBILE if mobile_browser else USER_AGENT
    opener.addheaders = [('User-agent', user_agent)]
    proxies = get_proxies()
    to_add = {}
    http_proxy = proxies.get('http', None)
    if http_proxy:
        to_add['http'] = http_proxy
    https_proxy = proxies.get('https', None)
    if https_proxy:
        to_add['https'] = https_proxy
    if to_add:
        opener.set_proxies(to_add)
    return opener
 def fit_image(width, height, pwidth, pheight):
    '''
    Fit image in box of width pwidth and height pheight.
    @param width: Width of image
    @param height: Height of image
    @param pwidth: Width of box
    @param pheight: Height of box
    @return: scaled, new_width, new_height. scaled is True iff new_width and/or new_height is different from width or height.
    '''
    scaled = height > pheight or width > pwidth
    if height > pheight:
        corrf = pheight / float(height)
        width, height = floor(corrf*width), pheight
    if width > pwidth:
        corrf = pwidth / float(width)
        width, height = pwidth, floor(corrf*height)
    if height > pheight:
        corrf = pheight / float(height)
        width, height = floor(corrf*width), pheight
    return scaled, int(width), int(height)
 class CurrentDir(object):
    def __init__(self, path):
        self.path = path
        self.cwd = None
    def __enter__(self, *args):
        self.cwd = os.getcwd()
        os.chdir(self.path)
        return self.cwd
    def __exit__(self, *args):
        try:
            os.chdir(self.cwd)
        except EnvironmentError:
            # The previous CWD no longer exists
            pass
 _ncpus = None
 if ispy3:
    def detect_ncpus():
        global _ncpus
        if _ncpus is None:
            _ncpus = max(1, os.cpu_count() or 1)
        return _ncpus
 else:
    def detect_ncpus():
        """Detects the number of effective CPUs in the system"""
        global _ncpus
        if _ncpus is None:
            if iswindows:
                import win32api
                ans = win32api.GetSystemInfo()[5]
            else:
                import multiprocessing
                ans = -1
                try:
                    ans = multiprocessing.cpu_count()
                except Exception:
                    from PyQt5.Qt import QThread
                    ans = QThread.idealThreadCount()
            _ncpus = max(1, ans)
        return _ncpus
 relpath = os.path.relpath
 def walk(dir):
    ''' A nice interface to os.walk '''
    for record in os.walk(dir):
        for f in record[-1]:
            yield os.path.join(record[0], f)
 def strftime(fmt, t=None):
    ''' A version of strftime that returns unicode strings and tries to handle dates
    before 1900 '''
    if not fmt:
        return ''
    if t is None:
        t = time.localtime()
    if hasattr(t, 'timetuple'):
        t = t.timetuple()
    early_year = t[0] < 1900
    if early_year:
        replacement = 1900 if t[0]%4 == 0 else 1901
        fmt = fmt.replace('%Y', '_early year hack##')
        t = list(t)
        orig_year = t[0]
        t[0] = replacement
        t = time.struct_time(t)
    ans = None
    if iswindows:
        if isinstance(fmt, bytes):
            fmt = fmt.decode('mbcs', 'replace')
        fmt = fmt.replace('%e', '%#d')
        ans = plugins['winutil'][0].strftime(fmt, t)
    else:
        ans = time.strftime(fmt, t)
        if isinstance(ans, bytes):
            ans = ans.decode(preferred_encoding, 'replace')
    if early_year:
        ans = ans.replace('_early year hack##', unicode_type(orig_year))
    return ans
 def my_unichr(num):
    try:
        return safe_chr(num)
    except (ValueError, OverflowError):
        return '?'
 def entity_to_unicode(match, exceptions=[], encoding='cp1252',
        result_exceptions={}):
    '''
    :param match: A match object such that '&'+match.group(1)';' is the entity.
    :param exceptions: A list of entities to not convert (Each entry is the name of the entity, for e.g. 'apos' or '#1234'
    :param encoding: The encoding to use to decode numeric entities between 128 and 256.
    If None, the Unicode UCS encoding is used. A common encoding is cp1252.
    :param result_exceptions: A mapping of characters to entities. If the result
    is in result_exceptions, result_exception[result] is returned instead.
    Convenient way to specify exception for things like < or > that can be
    specified by various actual entities.
    '''
    def check(ch):
        return result_exceptions.get(ch, ch)
    ent = match.group(1)
    if ent in exceptions:
        return '&'+ent+';'
    if ent in {'apos', 'squot'}:  # squot is generated by some broken CMS software
        return check("'")
    if ent == 'hellips':
        ent = 'hellip'
    if ent.startswith('#'):
        try:
            if ent[1] in ('x', 'X'):
                num = int(ent[2:], 16)
            else:
                num = int(ent[1:])
        except:
            return '&'+ent+';'
        if encoding is None or num > 255:
            return check(my_unichr(num))
        try:
            return check(bytes(bytearray((num,))).decode(encoding))
        except UnicodeDecodeError:
            return check(my_unichr(num))
    from calibre.ebooks.html_entities import html5_entities
    try:
        return check(html5_entities[ent])
    except KeyError:
        pass
    from polyglot.html_entities import name2codepoint
    try:
        return check(my_unichr(name2codepoint[ent]))
    except KeyError:
        return '&'+ent+';'
 _ent_pat = re.compile(r'&(\S+?);')
 xml_entity_to_unicode = partial(entity_to_unicode, result_exceptions={
    '"' : '&quot;',
    "'" : '&apos;',
    '<' : '&lt;',
    '>' : '&gt;',
    '&' : '&amp;'})
 def replace_entities(raw, encoding='cp1252'):
    return _ent_pat.sub(partial(entity_to_unicode, encoding=encoding), raw)
 def xml_replace_entities(raw, encoding='cp1252'):
    return _ent_pat.sub(partial(xml_entity_to_unicode, encoding=encoding), raw)
 def prepare_string_for_xml(raw, attribute=False):
    raw = _ent_pat.sub(entity_to_unicode, raw)
    raw = raw.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')
    if attribute:
        raw = raw.replace('"', '&quot;').replace("'", '&apos;')
    return raw
 def isbytestring(obj):
    return isinstance(obj, bytes)
 def force_unicode(obj, enc=preferred_encoding):
    if isbytestring(obj):
        try:
            obj = obj.decode(enc)
        except Exception:
            try:
                obj = obj.decode(filesystem_encoding if enc ==
                        preferred_encoding else preferred_encoding)
            except Exception:
                try:
                    obj = obj.decode('utf-8')
                except Exception:
                    obj = repr(obj)
                    if isbytestring(obj):
                        obj = obj.decode('utf-8')
    return obj
 def as_unicode(obj, enc=preferred_encoding):
    if not isbytestring(obj):
        try:
            obj = unicode_type(obj)
        except Exception:
            try:
                obj = native_string_type(obj)
            except Exception:
                obj = repr(obj)
    return force_unicode(obj, enc=enc)
 def url_slash_cleaner(url):
    '''
    Removes redundant /'s from url's.
    '''
    return re.sub(r'(?<!:)/{2,}', '/', url)
 def human_readable(size, sep=' '):
    """ Convert a size in bytes into a human readable form """
    divisor, suffix = 1, "B"
    for i, candidate in enumerate(('B', 'KB', 'MB', 'GB', 'TB', 'PB', 'EB')):
        if size < (1 << ((i + 1) * 10)):
            divisor, suffix = (1 << (i * 10)), candidate
            break
    size = unicode_type(float(size)/divisor)
    if size.find(".") > -1:
        size = size[:size.find(".")+2]
    if size.endswith('.0'):
        size = size[:-2]
    return size + sep + suffix
 def ipython(user_ns=None):
    from calibre.utils.ipython import ipython
    ipython(user_ns=user_ns)
 def fsync(fileobj):
    fileobj.flush()
    os.fsync(fileobj.fileno())
    if islinux and getattr(fileobj, 'name', None):
        # On Linux kernels after 5.1.9 and 4.19.50 using fsync without any
        # following activity causes Kindles to eject. Instead of fixing this in
        # the obvious way, which is to have the kernel send some harmless
        # filesystem activity after the FSYNC, the kernel developers seem to
        # think the correct solution is to disable FSYNC using a mount flag
        # which users will have to turn on manually. So instead we create some
        # harmless filesystem activity, and who cares about performance.
        # See https://bugs.launchpad.net/calibre/+bug/1834641
        # and https://bugzilla.kernel.org/show_bug.cgi?id=203973
        # To check for the existence of the bug, simply run:
        # python -c "p = '/run/media/kovid/Kindle/driveinfo.calibre'; f = open(p, 'r+b'); os.fsync(f.fileno());"
        # this will cause the Kindle to disconnect.
        try:
            os.utime(fileobj.name, None)
        except Exception:
            import traceback
            traceback.print_exc()
--- a/ebook_converter/constants.py
+++ b/ebook_converter/constants.py
@@ -0,0 +1,343 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 # License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net>
 from __future__ import print_function, unicode_literals
 from polyglot.builtins import map, unicode_type, environ_item, hasenv, getenv, as_unicode, native_string_type
 import sys, locale, codecs, os, importlib, collections
 __appname__   = 'calibre'
 numeric_version = (4, 12, 0)
 __version__   = '.'.join(map(unicode_type, numeric_version))
 git_version   = None
 __author__    = "Kovid Goyal <kovid@kovidgoyal.net>"
 '''
 Various run time constants.
 '''
 _plat = sys.platform.lower()
 iswindows = 'win32' in _plat or 'win64' in _plat
 isosx     = 'darwin' in _plat
 isnewosx  = isosx and getattr(sys, 'new_app_bundle', False)
 isfreebsd = 'freebsd' in _plat
 isnetbsd = 'netbsd' in _plat
 isdragonflybsd = 'dragonfly' in _plat
 isbsd = isfreebsd or isnetbsd or isdragonflybsd
 ishaiku = 'haiku1' in _plat
 islinux   = not(iswindows or isosx or isbsd or ishaiku)
 isfrozen  = hasattr(sys, 'frozen')
 isunix = isosx or islinux or ishaiku
 isportable = hasenv('CALIBRE_PORTABLE_BUILD')
 ispy3 = sys.version_info.major > 2
 isxp = isoldvista = False
 if iswindows:
    wver = sys.getwindowsversion()
    isxp = wver.major < 6
    isoldvista = wver.build < 6002
 is64bit = sys.maxsize > (1 << 32)
 isworker = hasenv('CALIBRE_WORKER') or hasenv('CALIBRE_SIMPLE_WORKER')
 if isworker:
    os.environ.pop(environ_item('CALIBRE_FORCE_ANSI'), None)
 FAKE_PROTOCOL, FAKE_HOST = 'clbr', 'internal.invalid'
 VIEWER_APP_UID = 'com.calibre-ebook.viewer'
 EDITOR_APP_UID = 'com.calibre-ebook.edit-book'
 MAIN_APP_UID = 'com.calibre-ebook.main-gui'
 STORE_DIALOG_APP_UID = 'com.calibre-ebook.store-dialog'
 TOC_DIALOG_APP_UID = 'com.calibre-ebook.toc-editor'
 try:
    preferred_encoding = locale.getpreferredencoding()
    codecs.lookup(preferred_encoding)
 except:
    preferred_encoding = 'utf-8'
 win32event = importlib.import_module('win32event') if iswindows else None
 winerror   = importlib.import_module('winerror') if iswindows else None
 win32api   = importlib.import_module('win32api') if iswindows else None
 fcntl      = None if iswindows else importlib.import_module('fcntl')
 dark_link_color = '#6cb4ee'
 _osx_ver = None
 def get_osx_version():
    global _osx_ver
    if _osx_ver is None:
        import platform
        from collections import namedtuple
        OSX = namedtuple('OSX', 'major minor tertiary')
        try:
            ver = platform.mac_ver()[0].split('.')
            if len(ver) == 2:
                ver.append(0)
            _osx_ver = OSX(*map(int, ver))  # no2to3
        except Exception:
            _osx_ver = OSX(0, 0, 0)
    return _osx_ver
 filesystem_encoding = sys.getfilesystemencoding()
 if filesystem_encoding is None:
    filesystem_encoding = 'utf-8'
 else:
    try:
        if codecs.lookup(filesystem_encoding).name == 'ascii':
            filesystem_encoding = 'utf-8'
            # On linux, unicode arguments to os file functions are coerced to an ascii
            # bytestring if sys.getfilesystemencoding() == 'ascii', which is
            # just plain dumb. This is fixed by the icu.py module which, when
            # imported changes ascii to utf-8
    except Exception:
        filesystem_encoding = 'utf-8'
 DEBUG = hasenv('CALIBRE_DEBUG')
 def debug():
    global DEBUG
    DEBUG = True
 def _get_cache_dir():
    import errno
    confcache = os.path.join(config_dir, 'caches')
    try:
        os.makedirs(confcache)
    except EnvironmentError as err:
        if err.errno != errno.EEXIST:
            raise
    if isportable:
        return confcache
    ccd = getenv('CALIBRE_CACHE_DIRECTORY')
    if ccd is not None:
        ans = os.path.abspath(ccd)
        try:
            os.makedirs(ans)
            return ans
        except EnvironmentError as err:
            if err.errno == errno.EEXIST:
                return ans
    if iswindows:
        w = plugins['winutil'][0]
        try:
            candidate = os.path.join(w.special_folder_path(w.CSIDL_LOCAL_APPDATA), '%s-cache'%__appname__)
        except ValueError:
            return confcache
    elif isosx:
        candidate = os.path.join(os.path.expanduser('~/Library/Caches'), __appname__)
    else:
        candidate = getenv('XDG_CACHE_HOME', '~/.cache')
        candidate = os.path.join(os.path.expanduser(candidate),
                                    __appname__)
        if isinstance(candidate, bytes):
            try:
                candidate = candidate.decode(filesystem_encoding)
            except ValueError:
                candidate = confcache
    try:
        os.makedirs(candidate)
    except EnvironmentError as err:
        if err.errno != errno.EEXIST:
            candidate = confcache
    return candidate
 def cache_dir():
    ans = getattr(cache_dir, 'ans', None)
    if ans is None:
        ans = cache_dir.ans = os.path.realpath(_get_cache_dir())
    return ans
 plugins_loc = sys.extensions_location
 if ispy3:
    plugins_loc = os.path.join(plugins_loc, '3')
 # plugins {{{
 class Plugins(collections.Mapping):
    def __init__(self):
        self._plugins = {}
        plugins = [
                'pictureflow',
                'lzx',
                'msdes',
                'podofo',
                'cPalmdoc',
                'progress_indicator',
                'chmlib',
                'icu',
                'speedup',
                'html_as_json',
                'unicode_names',
                'html_syntax_highlighter',
                'hyphen',
                'freetype',
                'imageops',
                'hunspell',
                '_patiencediff_c',
                'bzzdec',
                'matcher',
                'tokenizer',
                'certgen',
                'lzma_binding',
            ]
        if not ispy3:
            plugins.extend([
                'monotonic',
                'zlib2',
            ])
        if iswindows:
            plugins.extend(['winutil', 'wpd', 'winfonts'])
        if isosx:
            plugins.append('usbobserver')
            plugins.append('cocoa')
        if isfreebsd or ishaiku or islinux or isosx:
            plugins.append('libusb')
            plugins.append('libmtp')
        self.plugins = frozenset(plugins)
    def load_plugin(self, name):
        if name in self._plugins:
            return
        sys.path.insert(0, plugins_loc)
        try:
            del sys.modules[name]
        except KeyError:
            pass
        plugin_err = ''
        try:
            p = importlib.import_module(name)
        except Exception as err:
            p = None
            try:
                plugin_err = unicode_type(err)
            except Exception:
                plugin_err = as_unicode(native_string_type(err), encoding=preferred_encoding, errors='replace')
        self._plugins[name] = p, plugin_err
        sys.path.remove(plugins_loc)
    def __iter__(self):
        return iter(self.plugins)
    def __len__(self):
        return len(self.plugins)
    def __contains__(self, name):
        return name in self.plugins
    def __getitem__(self, name):
        if name not in self.plugins:
            raise KeyError('No plugin named %r'%name)
        self.load_plugin(name)
        return self._plugins[name]
 plugins = None
 if plugins is None:
    plugins = Plugins()
 # }}}
 # config_dir {{{
 CONFIG_DIR_MODE = 0o700
 cconfd = getenv('CALIBRE_CONFIG_DIRECTORY')
 if cconfd is not None:
    config_dir = os.path.abspath(cconfd)
 elif iswindows:
    if plugins['winutil'][0] is None:
        raise Exception(plugins['winutil'][1])
    try:
        config_dir = plugins['winutil'][0].special_folder_path(plugins['winutil'][0].CSIDL_APPDATA)
    except ValueError:
        config_dir = None
    if not config_dir or not os.access(config_dir, os.W_OK|os.X_OK):
        config_dir = os.path.expanduser('~')
    config_dir = os.path.join(config_dir, 'calibre')
 elif isosx:
    config_dir = os.path.expanduser('~/Library/Preferences/calibre')
 else:
    bdir = os.path.abspath(os.path.expanduser(getenv('XDG_CONFIG_HOME', '~/.config')))
    config_dir = os.path.join(bdir, 'calibre')
    try:
        os.makedirs(config_dir, mode=CONFIG_DIR_MODE)
    except:
        pass
    if not os.path.exists(config_dir) or \
            not os.access(config_dir, os.W_OK) or not \
            os.access(config_dir, os.X_OK):
        print('No write acces to', config_dir, 'using a temporary dir instead')
        import tempfile, atexit
        config_dir = tempfile.mkdtemp(prefix='calibre-config-')
        def cleanup_cdir():
            try:
                import shutil
                shutil.rmtree(config_dir)
            except:
                pass
        atexit.register(cleanup_cdir)
 # }}}
 dv = getenv('CALIBRE_DEVELOP_FROM')
 is_running_from_develop = bool(getattr(sys, 'frozen', False) and dv and os.path.abspath(dv) in sys.path)
 del dv
 def get_version():
    '''Return version string for display to user '''
    if git_version is not None:
        v = git_version
    else:
        v = __version__
        if numeric_version[-1] == 0:
            v = v[:-2]
    if is_running_from_develop:
        v += '*'
    if iswindows and is64bit:
        v += ' [64bit]'
    return v
 def get_portable_base():
    'Return path to the directory that contains calibre-portable.exe or None'
    if isportable:
        return os.path.dirname(os.path.dirname(getenv('CALIBRE_PORTABLE_BUILD')))
 def get_windows_username():
    '''
    Return the user name of the currently logged in user as a unicode string.
    Note that usernames on windows are case insensitive, the case of the value
    returned depends on what the user typed into the login box at login time.
    '''
    username = plugins['winutil'][0].username
    return username()
 def get_windows_temp_path():
    temp_path = plugins['winutil'][0].temp_path
    return temp_path()
 def get_windows_user_locale_name():
    locale_name = plugins['winutil'][0].locale_name
    return locale_name()
 def get_windows_number_formats():
    ans = getattr(get_windows_number_formats, 'ans', None)
    if ans is None:
        localeconv = plugins['winutil'][0].localeconv
        d = localeconv()
        thousands_sep, decimal_point = d['thousands_sep'], d['decimal_point']
        ans = get_windows_number_formats.ans = thousands_sep, decimal_point
    return ans
--- a/ebook_converter/css_selectors/init.py
+++ b/ebook_converter/css_selectors/init.py
@@ -0,0 +1,12 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
 from css_selectors.parser import parse
 from css_selectors.select import Select, INAPPROPRIATE_PSEUDO_CLASSES
 from css_selectors.errors import SelectorError, SelectorSyntaxError, ExpressionError
 __all__ = ['parse', 'Select', 'INAPPROPRIATE_PSEUDO_CLASSES', 'SelectorError', 'SelectorSyntaxError', 'ExpressionError']
--- a/ebook_converter/css_selectors/errors.py
+++ b/ebook_converter/css_selectors/errors.py
@@ -0,0 +1,18 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
 class SelectorError(ValueError):
    """Common parent for SelectorSyntaxError and ExpressionError"""
 class SelectorSyntaxError(SelectorError):
    """Parsing a selector that does not match the grammar."""
 class ExpressionError(SelectorError):
    """Unknown or unsupported selector (eg. pseudo-class)."""
--- a/ebook_converter/css_selectors/ordered_set.py
+++ b/ebook_converter/css_selectors/ordered_set.py
@@ -0,0 +1,133 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
 import collections
 from polyglot.builtins import string_or_bytes
 SLICE_ALL = slice(None)
 def is_iterable(obj):
    """
    Are we being asked to look up a list of things, instead of a single thing?
    We check for the `__iter__` attribute so that this can cover types that
    don't have to be known by this module, such as NumPy arrays.
    Strings, however, should be considered as atomic values to look up, not
    iterables.
    """
    return hasattr(obj, '__iter__') and not isinstance(obj, string_or_bytes)
 class OrderedSet(collections.MutableSet):
    """
    An OrderedSet is a custom MutableSet that remembers its order, so that
    every entry has an index that can be looked up.
    """
    def __init__(self, iterable=None):
        self.items = []
        self.map = {}
        if iterable is not None:
            for item in iterable:
                idx = self.map.get(item)
                if idx is None:
                    self.map[item] = len(self.items)
                    self.items.append(item)
    def __len__(self):
        return len(self.items)
    def __getitem__(self, index):
        """
        Get the item at a given index.
        If `index` is a slice, you will get back that slice of items. If it's
        the slice [:], exactly the same object is returned. (If you want an
        independent copy of an OrderedSet, use `OrderedSet.copy()`.)
        If `index` is an iterable, you'll get the OrderedSet of items
        corresponding to those indices. This is similar to NumPy's
        "fancy indexing".
        """
        if index == SLICE_ALL:
            return self
        elif hasattr(index, '__index__') or isinstance(index, slice):
            result = self.items[index]
            if isinstance(result, list):
                return OrderedSet(result)
            else:
                return result
        elif is_iterable(index):
            return OrderedSet([self.items[i] for i in index])
        else:
            raise TypeError("Don't know how to index an OrderedSet by %r" %
                    index)
    def copy(self):
        return OrderedSet(self)
    def __getstate__(self):
        return tuple(self)
    def __setstate__(self, state):
        self.__init__(state)
    def __contains__(self, key):
        return key in self.map
    def add(self, key):
        """
        Add `key` as an item to this OrderedSet, then return its index.
        If `key` is already in the OrderedSet, return the index it already
        had.
        """
        index = self.map.get(key)
        if index is None:
            self.map[key] = index = len(self.items)
            self.items.append(key)
        return index
    def index(self, key):
        """
        Get the index of a given entry, raising an IndexError if it's not
        present.
        `key` can be an iterable of entries that is not a string, in which case
        this returns a list of indices.
        """
        if is_iterable(key):
            return [self.index(subkey) for subkey in key]
        return self.map[key]
    def discard(self, key):
        index = self.map.get(key)
        if index is not None:
            self.items.pop(index)
            for item in self.items[index:]:
                self.map[item] -= 1
            return True
        return False
    def __iter__(self):
        return iter(self.items)
    def __reversed__(self):
        return reversed(self.items)
    def __repr__(self):
        if not self:
            return '%s()' % (self.__class__.__name__,)
        return '%s(%r)' % (self.__class__.__name__, list(self))
    def __eq__(self, other):
        if isinstance(other, OrderedSet):
            return len(self) == len(other) and self.items == other.items
        try:
            return type(other)(self.map) == other
        except TypeError:
            return False
--- a/ebook_converter/css_selectors/parser.py
+++ b/ebook_converter/css_selectors/parser.py
@@ -0,0 +1,791 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 """
    Tokenizer, parser and parsed objects for CSS selectors.
    :copyright: (c) 2007-2012 Ian Bicking and contributors.
                See AUTHORS for more details.
    :license: BSD, see LICENSE for more details.
 """
 import sys
 import re
 import operator
 import string
 from css_selectors.errors import SelectorSyntaxError, ExpressionError
 from polyglot.builtins import unicode_type, codepoint_to_chr, range
 utab = {c:c+32 for c in range(ord(u'A'), ord(u'Z')+1)}
 if sys.version_info.major < 3:
    tab = string.maketrans(string.ascii_uppercase, string.ascii_lowercase)
    def ascii_lower(string):
        """Lower-case, but only in the ASCII range."""
        return string.translate(utab if isinstance(string, unicode_type) else tab)
    def urepr(x):
        if isinstance(x, list):
            return '[%s]' % ', '.join((map(urepr, x)))
        ans = repr(x)
        if ans.startswith("u'") or ans.startswith('u"'):
            ans = ans[1:]
        return ans
 else:
    def ascii_lower(x):
        return x.translate(utab)
    urepr = repr
 # Parsed objects
 class Selector(object):
    """
    Represents a parsed selector.
    """
    def __init__(self, tree, pseudo_element=None):
        self.parsed_tree = tree
        if pseudo_element is not None and not isinstance(
                pseudo_element, FunctionalPseudoElement):
            pseudo_element = ascii_lower(pseudo_element)
        #: A :class:`FunctionalPseudoElement`,
        #: or the identifier for the pseudo-element as a string,
        #  or ``None``.
        #:
        #: +-------------------------+----------------+--------------------------------+
        #: |                         | Selector       | Pseudo-element                 |
        #: +=========================+================+================================+
        #: | CSS3 syntax             | ``a::before``  | ``'before'``                   |
        #: +-------------------------+----------------+--------------------------------+
        #: | Older syntax            | ``a:before``   | ``'before'``                   |
        #: +-------------------------+----------------+--------------------------------+
        #: | From the Lists3_ draft, | ``li::marker`` | ``'marker'``                   |
        #: | not in Selectors3       |                |                                |
        #: +-------------------------+----------------+--------------------------------+
        #: | Invalid pseudo-class    | ``li:marker``  | ``None``                       |
        #: +-------------------------+----------------+--------------------------------+
        #: | Functinal               | ``a::foo(2)``  | ``FunctionalPseudoElement(…)`` |
        #: +-------------------------+----------------+--------------------------------+
        #:
        # : .. _Lists3: http://www.w3.org/TR/2011/WD-css3-lists-20110524/#marker-pseudoelement
        self.pseudo_element = pseudo_element
    def __repr__(self):
        if isinstance(self.pseudo_element, FunctionalPseudoElement):
            pseudo_element = repr(self.pseudo_element)
        if self.pseudo_element:
            pseudo_element = '::%s' % self.pseudo_element
        else:
            pseudo_element = ''
        return '%s[%r%s]' % (
            self.__class__.__name__, self.parsed_tree, pseudo_element)
    def specificity(self):
        """Return the specificity_ of this selector as a tuple of 3 integers.
        .. _specificity: http://www.w3.org/TR/selectors/#specificity
        """
        a, b, c = self.parsed_tree.specificity()
        if self.pseudo_element:
            c += 1
        return a, b, c
 class Class(object):
    """
    Represents selector.class_name
    """
    def __init__(self, selector, class_name):
        self.selector = selector
        self.class_name = class_name
    def __repr__(self):
        return '%s[%r.%s]' % (
            self.__class__.__name__, self.selector, self.class_name)
    def specificity(self):
        a, b, c = self.selector.specificity()
        b += 1
        return a, b, c
 class FunctionalPseudoElement(object):
    """
    Represents selector::name(arguments)
    .. attribute:: name
        The name (identifier) of the pseudo-element, as a string.
    .. attribute:: arguments
        The arguments of the pseudo-element, as a list of tokens.
        **Note:** tokens are not part of the public API,
        and may change between versions.
        Use at your own risks.
    """
    def __init__(self, name, arguments):
        self.name = ascii_lower(name)
        self.arguments = arguments
    def __repr__(self):
        return '%s[::%s(%s)]' % (
            self.__class__.__name__, self.name,
            urepr([token.value for token in self.arguments]))
    def argument_types(self):
        return [token.type for token in self.arguments]
    def specificity(self):
        a, b, c = self.selector.specificity()
        b += 1
        return a, b, c
 class Function(object):
    """
    Represents selector:name(expr)
    """
    def __init__(self, selector, name, arguments):
        self.selector = selector
        self.name = ascii_lower(name)
        self.arguments = arguments
        self._parsed_arguments = None
    def __repr__(self):
        return '%s[%r:%s(%s)]' % (
            self.__class__.__name__, self.selector, self.name,
            urepr([token.value for token in self.arguments]))
    def argument_types(self):
        return [token.type for token in self.arguments]
    @property
    def parsed_arguments(self):
        if self._parsed_arguments is None:
            try:
                self._parsed_arguments = parse_series(self.arguments)
            except ValueError:
                raise ExpressionError("Invalid series: '%r'" % self.arguments)
        return self._parsed_arguments
    def parse_arguments(self):
        if not self.arguments_parsed:
            self.arguments_parsed = True
    def specificity(self):
        a, b, c = self.selector.specificity()
        b += 1
        return a, b, c
 class Pseudo(object):
    """
    Represents selector:ident
    """
    def __init__(self, selector, ident):
        self.selector = selector
        self.ident = ascii_lower(ident)
    def __repr__(self):
        return '%s[%r:%s]' % (
            self.__class__.__name__, self.selector, self.ident)
    def specificity(self):
        a, b, c = self.selector.specificity()
        b += 1
        return a, b, c
 class Negation(object):
    """
    Represents selector:not(subselector)
    """
    def __init__(self, selector, subselector):
        self.selector = selector
        self.subselector = subselector
    def __repr__(self):
        return '%s[%r:not(%r)]' % (
            self.__class__.__name__, self.selector, self.subselector)
    def specificity(self):
        a1, b1, c1 = self.selector.specificity()
        a2, b2, c2 = self.subselector.specificity()
        return a1 + a2, b1 + b2, c1 + c2
 class Attrib(object):
    """
    Represents selector[namespace|attrib operator value]
    """
    def __init__(self, selector, namespace, attrib, operator, value):
        self.selector = selector
        self.namespace = namespace
        self.attrib = attrib
        self.operator = operator
        self.value = value
    def __repr__(self):
        if self.namespace:
            attrib = '%s|%s' % (self.namespace, self.attrib)
        else:
            attrib = self.attrib
        if self.operator == 'exists':
            return '%s[%r[%s]]' % (
                self.__class__.__name__, self.selector, attrib)
        else:
            return '%s[%r[%s %s %s]]' % (
                self.__class__.__name__, self.selector, attrib,
                self.operator, urepr(self.value))
    def specificity(self):
        a, b, c = self.selector.specificity()
        b += 1
        return a, b, c
 class Element(object):
    """
    Represents namespace|element
    `None` is for the universal selector '*'
    """
    def __init__(self, namespace=None, element=None):
        self.namespace = namespace
        self.element = element
    def __repr__(self):
        element = self.element or '*'
        if self.namespace:
            element = '%s|%s' % (self.namespace, element)
        return '%s[%s]' % (self.__class__.__name__, element)
    def specificity(self):
        if self.element:
            return 0, 0, 1
        else:
            return 0, 0, 0
 class Hash(object):
    """
    Represents selector#id
    """
    def __init__(self, selector, id):
        self.selector = selector
        self.id = id
    def __repr__(self):
        return '%s[%r#%s]' % (
            self.__class__.__name__, self.selector, self.id)
    def specificity(self):
        a, b, c = self.selector.specificity()
        a += 1
        return a, b, c
 class CombinedSelector(object):
    def __init__(self, selector, combinator, subselector):
        assert selector is not None
        self.selector = selector
        self.combinator = combinator
        self.subselector = subselector
    def __repr__(self):
        if self.combinator == ' ':
            comb = '<followed>'
        else:
            comb = self.combinator
        return '%s[%r %s %r]' % (
            self.__class__.__name__, self.selector, comb, self.subselector)
    def specificity(self):
        a1, b1, c1 = self.selector.specificity()
        a2, b2, c2 = self.subselector.specificity()
        return a1 + a2, b1 + b2, c1 + c2
 # Parser
 # foo
 _el_re = re.compile(r'^[ \t\r\n\f]*([a-zA-Z]+)[ \t\r\n\f]*$')
 # foo#bar or #bar
 _id_re = re.compile(r'^[ \t\r\n\f]*([a-zA-Z]*)#([a-zA-Z0-9_-]+)[ \t\r\n\f]*$')
 # foo.bar or .bar
 _class_re = re.compile(
    r'^[ \t\r\n\f]*([a-zA-Z]*)\.([a-zA-Z][a-zA-Z0-9_-]*)[ \t\r\n\f]*$')
 def parse(css):
    """Parse a CSS *group of selectors*.
    :param css:
        A *group of selectors* as an Unicode string.
    :raises:
        :class:`SelectorSyntaxError` on invalid selectors.
    :returns:
        A list of parsed :class:`Selector` objects, one for each
        selector in the comma-separated group.
    """
    # Fast path for simple cases
    match = _el_re.match(css)
    if match:
        return [Selector(Element(element=match.group(1)))]
    match = _id_re.match(css)
    if match is not None:
        return [Selector(Hash(Element(element=match.group(1) or None),
                              match.group(2)))]
    match = _class_re.match(css)
    if match is not None:
        return [Selector(Class(Element(element=match.group(1) or None),
                               match.group(2)))]
    stream = TokenStream(tokenize(css))
    stream.source = css
    return list(parse_selector_group(stream))
 #    except SelectorSyntaxError:
 #        e = sys.exc_info()[1]
 #        message = "%s at %s -> %r" % (
 #            e, stream.used, stream.peek())
 #        e.msg = message
 #        e.args = tuple([message])
 #        raise
 def parse_selector_group(stream):
    stream.skip_whitespace()
    while 1:
        yield Selector(*parse_selector(stream))
        if stream.peek() == ('DELIM', ','):
            stream.next()
            stream.skip_whitespace()
        else:
            break
 def parse_selector(stream):
    result, pseudo_element = parse_simple_selector(stream)
    while 1:
        stream.skip_whitespace()
        peek = stream.peek()
        if peek in (('EOF', None), ('DELIM', ',')):
            break
        if pseudo_element:
            raise SelectorSyntaxError(
                'Got pseudo-element ::%s not at the end of a selector'
                % pseudo_element)
        if peek.is_delim('+', '>', '~'):
            # A combinator
            combinator = stream.next().value
            stream.skip_whitespace()
        else:
            # By exclusion, the last parse_simple_selector() ended
            # at peek == ' '
            combinator = ' '
        next_selector, pseudo_element = parse_simple_selector(stream)
        result = CombinedSelector(result, combinator, next_selector)
    return result, pseudo_element
 special_pseudo_elements = (
    'first-line', 'first-letter', 'before', 'after')
 def parse_simple_selector(stream, inside_negation=False):
    stream.skip_whitespace()
    selector_start = len(stream.used)
    peek = stream.peek()
    if peek.type == 'IDENT' or peek == ('DELIM', '*'):
        if peek.type == 'IDENT':
            namespace = stream.next().value
        else:
            stream.next()
            namespace = None
        if stream.peek() == ('DELIM', '|'):
            stream.next()
            element = stream.next_ident_or_star()
        else:
            element = namespace
            namespace = None
    else:
        element = namespace = None
    result = Element(namespace, element)
    pseudo_element = None
    while 1:
        peek = stream.peek()
        if peek.type in ('S', 'EOF') or peek.is_delim(',', '+', '>', '~') or (
                inside_negation and peek == ('DELIM', ')')):
            break
        if pseudo_element:
            raise SelectorSyntaxError(
                'Got pseudo-element ::%s not at the end of a selector'
                % pseudo_element)
        if peek.type == 'HASH':
            result = Hash(result, stream.next().value)
        elif peek == ('DELIM', '.'):
            stream.next()
            result = Class(result, stream.next_ident())
        elif peek == ('DELIM', '['):
            stream.next()
            result = parse_attrib(result, stream)
        elif peek == ('DELIM', ':'):
            stream.next()
            if stream.peek() == ('DELIM', ':'):
                stream.next()
                pseudo_element = stream.next_ident()
                if stream.peek() == ('DELIM', '('):
                    stream.next()
                    pseudo_element = FunctionalPseudoElement(
                        pseudo_element, parse_arguments(stream))
                continue
            ident = stream.next_ident()
            if ident.lower() in special_pseudo_elements:
                # Special case: CSS 2.1 pseudo-elements can have a single ':'
                # Any new pseudo-element must have two.
                pseudo_element = unicode_type(ident)
                continue
            if stream.peek() != ('DELIM', '('):
                result = Pseudo(result, ident)
                continue
            stream.next()
            stream.skip_whitespace()
            if ident.lower() == 'not':
                if inside_negation:
                    raise SelectorSyntaxError('Got nested :not()')
                argument, argument_pseudo_element = parse_simple_selector(
                    stream, inside_negation=True)
                next = stream.next()
                if argument_pseudo_element:
                    raise SelectorSyntaxError(
                        'Got pseudo-element ::%s inside :not() at %s'
                        % (argument_pseudo_element, next.pos))
                if next != ('DELIM', ')'):
                    raise SelectorSyntaxError("Expected ')', got %s" % (next,))
                result = Negation(result, argument)
            else:
                result = Function(result, ident, parse_arguments(stream))
        else:
            raise SelectorSyntaxError(
                "Expected selector, got %s" % (peek,))
    if len(stream.used) == selector_start:
        raise SelectorSyntaxError(
            "Expected selector, got %s" % (stream.peek(),))
    return result, pseudo_element
 def parse_arguments(stream):
    arguments = []
    while 1:
        stream.skip_whitespace()
        next = stream.next()
        if next.type in ('IDENT', 'STRING', 'NUMBER') or next in [
                ('DELIM', '+'), ('DELIM', '-')]:
            arguments.append(next)
        elif next == ('DELIM', ')'):
            return arguments
        else:
            raise SelectorSyntaxError(
                "Expected an argument, got %s" % (next,))
 def parse_attrib(selector, stream):
    stream.skip_whitespace()
    attrib = stream.next_ident_or_star()
    if attrib is None and stream.peek() != ('DELIM', '|'):
        raise SelectorSyntaxError(
            "Expected '|', got %s" % (stream.peek(),))
    if stream.peek() == ('DELIM', '|'):
        stream.next()
        if stream.peek() == ('DELIM', '='):
            namespace = None
            stream.next()
            op = '|='
        else:
            namespace = attrib
            attrib = stream.next_ident()
            op = None
    else:
        namespace = op = None
    if op is None:
        stream.skip_whitespace()
        next = stream.next()
        if next == ('DELIM', ']'):
            return Attrib(selector, namespace, attrib, 'exists', None)
        elif next == ('DELIM', '='):
            op = '='
        elif next.is_delim('^', '$', '*', '~', '|', '!') and (
                stream.peek() == ('DELIM', '=')):
            op = next.value + '='
            stream.next()
        else:
            raise SelectorSyntaxError(
                "Operator expected, got %s" % (next,))
    stream.skip_whitespace()
    value = stream.next()
    if value.type not in ('IDENT', 'STRING'):
        raise SelectorSyntaxError(
            "Expected string or ident, got %s" % (value,))
    stream.skip_whitespace()
    next = stream.next()
    if next != ('DELIM', ']'):
        raise SelectorSyntaxError(
            "Expected ']', got %s" % (next,))
    return Attrib(selector, namespace, attrib, op, value.value)
 def parse_series(tokens):
    """
    Parses the arguments for :nth-child() and friends.
    :raises: A list of tokens
    :returns: :``(a, b)``
    """
    for token in tokens:
        if token.type == 'STRING':
            raise ValueError('String tokens not allowed in series.')
    s = ''.join(token.value for token in tokens).strip()
    if s == 'odd':
        return (2, 1)
    elif s == 'even':
        return (2, 0)
    elif s == 'n':
        return (1, 0)
    if 'n' not in s:
        # Just b
        return (0, int(s))
    a, b = s.split('n', 1)
    if not a:
        a = 1
    elif a == '-' or a == '+':
        a = int(a+'1')
    else:
        a = int(a)
    if not b:
        b = 0
    else:
        b = int(b)
    return (a, b)
 # Token objects
 class Token(tuple):
    def __new__(cls, type_, value, pos):
        obj = tuple.__new__(cls, (type_, value))
        obj.pos = pos
        return obj
    def __repr__(self):
        return "<%s '%s' at %i>" % (self.type, self.value, self.pos)
    def is_delim(self, *values):
        return self.type == 'DELIM' and self.value in values
    type = property(operator.itemgetter(0))
    value = property(operator.itemgetter(1))
 class EOFToken(Token):
    def __new__(cls, pos):
        return Token.__new__(cls, 'EOF', None, pos)
    def __repr__(self):
        return '<%s at %i>' % (self.type, self.pos)
 # Tokenizer
 class TokenMacros:
    unicode_escape = r'\\([0-9a-f]{1,6})(?:\r\n|[ \n\r\t\f])?'
    escape = unicode_escape + r'|\\[^\n\r\f0-9a-f]'
    string_escape = r'\\(?:\n|\r\n|\r|\f)|' + escape
    nonascii = r'[^\0-\177]'
    nmchar = '[_a-z0-9-]|%s|%s' % (escape, nonascii)
    nmstart = '[_a-z]|%s|%s' % (escape, nonascii)
 def _compile(pattern):
    return re.compile(pattern % vars(TokenMacros), re.IGNORECASE).match
 _match_whitespace = _compile(r'[ \t\r\n\f]+')
 _match_number = _compile(r'[+-]?(?:[0-9]*\.[0-9]+|[0-9]+)')
 _match_hash = _compile('#(?:%(nmchar)s)+')
 _match_ident = _compile('-?(?:%(nmstart)s)(?:%(nmchar)s)*')
 _match_string_by_quote = {
    "'": _compile(r"([^\n\r\f\\']|%(string_escape)s)*"),
    '"': _compile(r'([^\n\r\f\\"]|%(string_escape)s)*'),
 }
 _sub_simple_escape = re.compile(r'\\(.)').sub
 _sub_unicode_escape = re.compile(TokenMacros.unicode_escape, re.I).sub
 _sub_newline_escape =re.compile(r'\\(?:\n|\r\n|\r|\f)').sub
 # Same as r'\1', but faster on CPython
 if hasattr(operator, 'methodcaller'):
    # Python 2.6+
    _replace_simple = operator.methodcaller('group', 1)
 else:
    def _replace_simple(match):
        return match.group(1)
 def _replace_unicode(match):
    codepoint = int(match.group(1), 16)
    if codepoint > sys.maxunicode:
        codepoint = 0xFFFD
    return codepoint_to_chr(codepoint)
 def unescape_ident(value):
    value = _sub_unicode_escape(_replace_unicode, value)
    value = _sub_simple_escape(_replace_simple, value)
    return value
 def tokenize(s):
    pos = 0
    len_s = len(s)
    while pos < len_s:
        match = _match_whitespace(s, pos=pos)
        if match:
            yield Token('S', ' ', pos)
            pos = match.end()
            continue
        match = _match_ident(s, pos=pos)
        if match:
            value = _sub_simple_escape(_replace_simple,
                    _sub_unicode_escape(_replace_unicode, match.group()))
            yield Token('IDENT', value, pos)
            pos = match.end()
            continue
        match = _match_hash(s, pos=pos)
        if match:
            value = _sub_simple_escape(_replace_simple,
                    _sub_unicode_escape(_replace_unicode, match.group()[1:]))
            yield Token('HASH', value, pos)
            pos = match.end()
            continue
        quote = s[pos]
        if quote in _match_string_by_quote:
            match = _match_string_by_quote[quote](s, pos=pos + 1)
            assert match, 'Should have found at least an empty match'
            end_pos = match.end()
            if end_pos == len_s:
                raise SelectorSyntaxError('Unclosed string at %s' % pos)
            if s[end_pos] != quote:
                raise SelectorSyntaxError('Invalid string at %s' % pos)
            value = _sub_simple_escape(_replace_simple,
                    _sub_unicode_escape(_replace_unicode,
                    _sub_newline_escape('', match.group())))
            yield Token('STRING', value, pos)
            pos = end_pos + 1
            continue
        match = _match_number(s, pos=pos)
        if match:
            value = match.group()
            yield Token('NUMBER', value, pos)
            pos = match.end()
            continue
        pos2 = pos + 2
        if s[pos:pos2] == '/*':
            pos = s.find('*/', pos2)
            if pos == -1:
                pos = len_s
            else:
                pos += 2
            continue
        yield Token('DELIM', s[pos], pos)
        pos += 1
    assert pos == len_s
    yield EOFToken(pos)
 class TokenStream(object):
    def __init__(self, tokens, source=None):
        self.used = []
        self.tokens = iter(tokens)
        self.source = source
        self.peeked = None
        self._peeking = False
        try:
            self.next_token = self.tokens.next
        except AttributeError:
            # Python 3
            self.next_token = self.tokens.__next__
    def next(self):
        if self._peeking:
            self._peeking = False
            self.used.append(self.peeked)
            return self.peeked
        else:
            next = self.next_token()
            self.used.append(next)
            return next
    def peek(self):
        if not self._peeking:
            self.peeked = self.next_token()
            self._peeking = True
        return self.peeked
    def next_ident(self):
        next = self.next()
        if next.type != 'IDENT':
            raise SelectorSyntaxError('Expected ident, got %s' % (next,))
        return next.value
    def next_ident_or_star(self):
        next = self.next()
        if next.type == 'IDENT':
            return next.value
        elif next == ('DELIM', '*'):
            return None
        else:
            raise SelectorSyntaxError(
                "Expected ident or '*', got %s" % (next,))
    def skip_whitespace(self):
        peek = self.peek()
        if peek.type == 'S':
            self.next()
--- a/ebook_converter/css_selectors/select.py
+++ b/ebook_converter/css_selectors/select.py
@@ -0,0 +1,694 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
 import re, itertools
 from collections import OrderedDict, defaultdict
 from functools import wraps
 from itertools import chain
 from lxml import etree
 from css_selectors.errors import ExpressionError
 from css_selectors.parser import parse, ascii_lower, Element, FunctionalPseudoElement
 from css_selectors.ordered_set import OrderedSet
 from polyglot.builtins import iteritems, itervalues
 PARSE_CACHE_SIZE = 200
 parse_cache = OrderedDict()
 XPATH_CACHE_SIZE = 30
 xpath_cache = OrderedDict()
 # Test that the string is not empty and does not contain whitespace
 is_non_whitespace = re.compile(r'^[^ \t\r\n\f]+$').match
 def get_parsed_selector(raw):
    try:
        return parse_cache[raw]
    except KeyError:
        parse_cache[raw] = ans = parse(raw)
        if len(parse_cache) > PARSE_CACHE_SIZE:
            parse_cache.pop(next(iter(parse_cache)))
        return ans
 def get_compiled_xpath(expr):
    try:
        return xpath_cache[expr]
    except KeyError:
        xpath_cache[expr] = ans = etree.XPath(expr)
        if len(xpath_cache) > XPATH_CACHE_SIZE:
            xpath_cache.pop(next(iter(xpath_cache)))
        return ans
 class AlwaysIn(object):
    def __contains__(self, x):
        return True
 always_in = AlwaysIn()
 def trace_wrapper(func):
    @wraps(func)
    def trace(*args, **kwargs):
        targs = args[1:] if args and isinstance(args[0], Select) else args
        print('Called:', func.__name__, 'with args:', targs, kwargs or '')
        return func(*args, **kwargs)
    return trace
 def normalize_language_tag(tag):
    """Return a list of normalized combinations for a `BCP 47` language tag.
    Example:
    >>> normalize_language_tag('de_AT-1901')
    ['de-at-1901', 'de-at', 'de-1901', 'de']
    """
    # normalize:
    tag = ascii_lower(tag).replace('_','-')
    # split (except singletons, which mark the following tag as non-standard):
    tag = re.sub(r'-([a-zA-Z0-9])-', r'-\1_', tag)
    subtags = [subtag.replace('_', '-') for subtag in tag.split('-')]
    base_tag = (subtags.pop(0),)
    taglist = {base_tag[0]}
    # find all combinations of subtags
    for n in range(len(subtags), 0, -1):
        for tags in itertools.combinations(subtags, n):
            taglist.add('-'.join(base_tag + tags))
    return taglist
 INAPPROPRIATE_PSEUDO_CLASSES = frozenset((
    'active', 'after', 'disabled', 'visited', 'link', 'before', 'focus', 'first-letter', 'enabled', 'first-line', 'hover', 'checked', 'target'))
 class Select(object):
    '''
    This class implements CSS Level 3 selectors
    (http://www.w3.org/TR/css3-selectors) on an lxml tree, with caching for
    performance. To use:
    >>> from css_selectors import Select
    >>> select = Select(root)  # Where root is an lxml document
    >>> print(tuple(select('p.myclass')))
    Tags are returned in document order. Note that attribute and tag names are
    matched case-insensitively. Class and id values are also matched
    case-insensitively. Also namespaces are ignored (this is for performance of
    the common case).  The UI related selectors are not implemented, such as
    :enabled, :disabled, :checked, :hover, etc.  Similarly, the non-element
    related selectors such as ::first-line, ::first-letter, ::before, etc. are
    not implemented.
    WARNING: This class uses internal caches. You *must not* make any changes
    to the lxml tree. If you do make some changes, either create a new Select
    object or call :meth:`invalidate_caches`.
    This class can be easily sub-classed to work with tree implementations
    other than lxml. Simply override the methods in the ``Tree Integration``
    block below.
    The caching works by maintaining internal maps from classes/ids/tag
    names/etc.  to node sets. These caches are populated as needed, and used
    for all subsequent selections.  Thus, for best performance you should use
    the same selector object for finding the matching nodes for multiple
    queries. Of course, remember not to change the tree in between queries.
    '''
    combinator_mapping = {
        ' ': 'descendant',
        '>': 'child',
        '+': 'direct_adjacent',
        '~': 'indirect_adjacent',
    }
    attribute_operator_mapping = {
        'exists': 'exists',
        '=': 'equals',
        '~=': 'includes',
        '|=': 'dashmatch',
        '^=': 'prefixmatch',
        '$=': 'suffixmatch',
        '*=': 'substringmatch',
    }
    def __init__(self, root, default_lang=None, ignore_inappropriate_pseudo_classes=False, dispatch_map=None, trace=False):
        if hasattr(root, 'getroot'):
            root = root.getroot()
        self.root = root
        self.dispatch_map = dispatch_map or default_dispatch_map
        self.invalidate_caches()
        self.default_lang = default_lang
        if trace:
            self.dispatch_map = {k:trace_wrapper(v) for k, v in iteritems(self.dispatch_map)}
        if ignore_inappropriate_pseudo_classes:
            self.ignore_inappropriate_pseudo_classes = INAPPROPRIATE_PSEUDO_CLASSES
        else:
            self.ignore_inappropriate_pseudo_classes = frozenset()
    # External API {{{
    def invalidate_caches(self):
        'Invalidate all caches. You must call this before using this object if you have made changes to the HTML tree'
        self._element_map = None
        self._id_map = None
        self._class_map = None
        self._attrib_map = None
        self._attrib_space_map = None
        self._lang_map = None
        self.map_tag_name = ascii_lower
        if '{' in self.root.tag:
            def map_tag_name(x):
                return ascii_lower(x.rpartition('}')[2])
            self.map_tag_name = map_tag_name
    def __call__(self, selector, root=None):
        ''' Return an iterator over all matching tags, in document order.
        Normally, all matching tags in the document are returned, is you
        specify root, then only tags that are root or descendants of root are
        returned. Note that this can be very expensive if root has a lot of
        descendants. '''
        seen = set()
        if root is not None:
            root = frozenset(self.itertag(root))
        for parsed_selector in get_parsed_selector(selector):
            for item in self.iterparsedselector(parsed_selector):
                if item not in seen and (root is None or item in root):
                    yield item
                    seen.add(item)
    def has_matches(self, selector, root=None):
        'Return True iff selector matches at least one item in the tree'
        for elem in self(selector, root=root):
            return True
        return False
    # }}}
    def iterparsedselector(self, parsed_selector):
        type_name = type(parsed_selector).__name__
        try:
            func = self.dispatch_map[ascii_lower(type_name)]
        except KeyError:
            raise ExpressionError('%s is not supported' % type_name)
        for item in func(self, parsed_selector):
            yield item
    @property
    def element_map(self):
        if self._element_map is None:
            self._element_map = em = defaultdict(OrderedSet)
            for tag in self.itertag():
                em[self.map_tag_name(tag.tag)].add(tag)
        return self._element_map
    @property
    def id_map(self):
        if self._id_map is None:
            self._id_map = im = defaultdict(OrderedSet)
            lower = ascii_lower
            for elem in self.iteridtags():
                im[lower(elem.get('id'))].add(elem)
        return self._id_map
    @property
    def class_map(self):
        if self._class_map is None:
            self._class_map = cm = defaultdict(OrderedSet)
            lower = ascii_lower
            for elem in self.iterclasstags():
                for cls in elem.get('class').split():
                    cm[lower(cls)].add(elem)
        return self._class_map
    @property
    def attrib_map(self):
        if self._attrib_map is None:
            self._attrib_map = am = defaultdict(lambda : defaultdict(OrderedSet))
            map_attrib_name = ascii_lower
            if '{' in self.root.tag:
                def map_attrib_name(x):
                    return ascii_lower(x.rpartition('}')[2])
            for tag in self.itertag():
                for attr, val in iteritems(tag.attrib):
                    am[map_attrib_name(attr)][val].add(tag)
        return self._attrib_map
    @property
    def attrib_space_map(self):
        if self._attrib_space_map is None:
            self._attrib_space_map = am = defaultdict(lambda : defaultdict(OrderedSet))
            map_attrib_name = ascii_lower
            if '{' in self.root.tag:
                def map_attrib_name(x):
                    return ascii_lower(x.rpartition('}')[2])
            for tag in self.itertag():
                for attr, val in iteritems(tag.attrib):
                    for v in val.split():
                        am[map_attrib_name(attr)][v].add(tag)
        return self._attrib_space_map
    @property
    def lang_map(self):
        if self._lang_map is None:
            self._lang_map = lm = defaultdict(OrderedSet)
            dl = normalize_language_tag(self.default_lang) if self.default_lang else None
            lmap = {tag:dl for tag in self.itertag()} if dl else {}
            for tag in self.itertag():
                lang = None
                for attr in ('{http://www.w3.org/XML/1998/namespace}lang', 'lang'):
                    lang = tag.get(attr)
                if lang:
                    lang = normalize_language_tag(lang)
                    for dtag in self.itertag(tag):
                        lmap[dtag] = lang
            for tag, langs in iteritems(lmap):
                for lang in langs:
                    lm[lang].add(tag)
        return self._lang_map
    # Tree Integration {{{
    def itertag(self, tag=None):
        return (self.root if tag is None else tag).iter('*')
    def iterdescendants(self, tag=None):
        return (self.root if tag is None else tag).iterdescendants('*')
    def iterchildren(self, tag=None):
        return (self.root if tag is None else tag).iterchildren('*')
    def itersiblings(self, tag=None, preceding=False):
        return (self.root if tag is None else tag).itersiblings('*', preceding=preceding)
    def iteridtags(self):
        return get_compiled_xpath('//*[@id]')(self.root)
    def iterclasstags(self):
        return get_compiled_xpath('//*[@class]')(self.root)
    def sibling_count(self, child, before=True, same_type=False):
        ' Return the number of siblings before or after child or raise ValueError if child has no parent. '
        parent = child.getparent()
        if parent is None:
            raise ValueError('Child has no parent')
        if same_type:
            siblings = OrderedSet(child.itersiblings(preceding=before))
            return len(self.element_map[self.map_tag_name(child.tag)] & siblings)
        else:
            if before:
                return parent.index(child)
            return len(parent) - parent.index(child) - 1
    def all_sibling_count(self, child, same_type=False):
        ' Return the number of siblings of child or raise ValueError if child has no parent '
        parent = child.getparent()
        if parent is None:
            raise ValueError('Child has no parent')
        if same_type:
            siblings = OrderedSet(chain(child.itersiblings(preceding=False), child.itersiblings(preceding=True)))
            return len(self.element_map[self.map_tag_name(child.tag)] & siblings)
        else:
            return len(parent) - 1
    def is_empty(self, elem):
        ' Return True iff elem has no child tags and no text content '
        for child in elem:
            # Check for comment/PI nodes with tail text
            if child.tail:
                return False
        return len(tuple(elem.iterchildren('*'))) == 0 and not elem.text
    # }}}
 # Combinators {{{
 def select_combinedselector(cache, combined):
    """Translate a combined selector."""
    combinator = cache.combinator_mapping[combined.combinator]
    # Fast path for when the sub-selector is all elements
    right = None if isinstance(combined.subselector, Element) and (
        combined.subselector.element or '*') == '*' else cache.iterparsedselector(combined.subselector)
    for item in cache.dispatch_map[combinator](cache, cache.iterparsedselector(combined.selector), right):
        yield item
 def select_descendant(cache, left, right):
    """right is a child, grand-child or further descendant of left"""
    right = always_in if right is None else frozenset(right)
    for ancestor in left:
        for descendant in cache.iterdescendants(ancestor):
            if descendant in right:
                yield descendant
 def select_child(cache, left, right):
    """right is an immediate child of left"""
    right = always_in if right is None else frozenset(right)
    for parent in left:
        for child in cache.iterchildren(parent):
            if child in right:
                yield child
 def select_direct_adjacent(cache, left, right):
    """right is a sibling immediately after left"""
    right = always_in if right is None else frozenset(right)
    for parent in left:
        for sibling in cache.itersiblings(parent):
            if sibling in right:
                yield sibling
            break
 def select_indirect_adjacent(cache, left, right):
    """right is a sibling after left, immediately or not"""
    right = always_in if right is None else frozenset(right)
    for parent in left:
        for sibling in cache.itersiblings(parent):
            if sibling in right:
                yield sibling
 # }}}
 def select_element(cache, selector):
    """A type or universal selector."""
    element = selector.element
    if not element or element == '*':
        for elem in cache.itertag():
            yield elem
    else:
        for elem in cache.element_map[ascii_lower(element)]:
            yield elem
 def select_hash(cache, selector):
    'An id selector'
    items = cache.id_map[ascii_lower(selector.id)]
    if len(items) > 0:
        for elem in cache.iterparsedselector(selector.selector):
            if elem in items:
                yield elem
 def select_class(cache, selector):
    'A class selector'
    items = cache.class_map[ascii_lower(selector.class_name)]
    if items:
        for elem in cache.iterparsedselector(selector.selector):
            if elem in items:
                yield elem
 def select_negation(cache, selector):
    'Implement :not()'
    exclude = frozenset(cache.iterparsedselector(selector.subselector))
    for item in cache.iterparsedselector(selector.selector):
        if item not in exclude:
            yield item
 # Attribute selectors {{{
 def select_attrib(cache, selector):
    operator = cache.attribute_operator_mapping[selector.operator]
    items = frozenset(cache.dispatch_map[operator](cache, ascii_lower(selector.attrib), selector.value))
    for item in cache.iterparsedselector(selector.selector):
        if item in items:
            yield item
 def select_exists(cache, attrib, value=None):
    for elem_set in itervalues(cache.attrib_map[attrib]):
        for elem in elem_set:
            yield elem
 def select_equals(cache, attrib, value):
    for elem in cache.attrib_map[attrib][value]:
        yield elem
 def select_includes(cache, attrib, value):
    if is_non_whitespace(value):
        for elem in cache.attrib_space_map[attrib][value]:
            yield elem
 def select_dashmatch(cache, attrib, value):
    if value:
        for val, elem_set in iteritems(cache.attrib_map[attrib]):
            if val == value or val.startswith(value + '-'):
                for elem in elem_set:
                    yield elem
 def select_prefixmatch(cache, attrib, value):
    if value:
        for val, elem_set in iteritems(cache.attrib_map[attrib]):
            if val.startswith(value):
                for elem in elem_set:
                    yield elem
 def select_suffixmatch(cache, attrib, value):
    if value:
        for val, elem_set in iteritems(cache.attrib_map[attrib]):
            if val.endswith(value):
                for elem in elem_set:
                    yield elem
 def select_substringmatch(cache, attrib, value):
    if value:
        for val, elem_set in iteritems(cache.attrib_map[attrib]):
            if value in val:
                for elem in elem_set:
                    yield elem
 # }}}
 # Function selectors {{{
 def select_function(cache, function):
    """Select with a functional pseudo-class."""
    fname = function.name.replace('-', '_')
    try:
        func = cache.dispatch_map[fname]
    except KeyError:
        raise ExpressionError(
            "The pseudo-class :%s() is unknown" % function.name)
    if fname == 'lang':
        items = frozenset(func(cache, function))
        for item in cache.iterparsedselector(function.selector):
            if item in items:
                yield item
    else:
        for item in cache.iterparsedselector(function.selector):
            if func(cache, function, item):
                yield item
 def select_lang(cache, function):
    ' Implement :lang() '
    if function.argument_types() not in (['STRING'], ['IDENT']):
        raise ExpressionError("Expected a single string or ident for :lang(), got %r" % function.arguments)
    lang = function.arguments[0].value
    if lang:
        lang = ascii_lower(lang)
        lp = lang + '-'
        for tlang, elem_set in iteritems(cache.lang_map):
            if tlang == lang or (tlang is not None and tlang.startswith(lp)):
                for elem in elem_set:
                    yield elem
 def select_nth_child(cache, function, elem):
    ' Implement :nth-child() '
    a, b = function.parsed_arguments
    try:
        num = cache.sibling_count(elem) + 1
    except ValueError:
        return False
    if a == 0:
        return num == b
    n = (num - b) / a
    return n.is_integer() and n > -1
 def select_nth_last_child(cache, function, elem):
    ' Implement :nth-last-child() '
    a, b = function.parsed_arguments
    try:
        num = cache.sibling_count(elem, before=False) + 1
    except ValueError:
        return False
    if a == 0:
        return num == b
    n = (num - b) / a
    return n.is_integer() and n > -1
 def select_nth_of_type(cache, function, elem):
    ' Implement :nth-of-type() '
    a, b = function.parsed_arguments
    try:
        num = cache.sibling_count(elem, same_type=True) + 1
    except ValueError:
        return False
    if a == 0:
        return num == b
    n = (num - b) / a
    return n.is_integer() and n > -1
 def select_nth_last_of_type(cache, function, elem):
    ' Implement :nth-last-of-type() '
    a, b = function.parsed_arguments
    try:
        num = cache.sibling_count(elem, before=False, same_type=True) + 1
    except ValueError:
        return False
    if a == 0:
        return num == b
    n = (num - b) / a
    return n.is_integer() and n > -1
 # }}}
 # Pseudo elements {{{
 def pseudo_func(f):
    f.is_pseudo = True
    return f
@pseudo_func
 def allow_all(cache, item):
    return True
 def get_func_for_pseudo(cache, ident):
    try:
        func = cache.dispatch_map[ident.replace('-', '_')]
    except KeyError:
        if ident in cache.ignore_inappropriate_pseudo_classes:
            func = allow_all
        else:
            raise ExpressionError(
                "The pseudo-class :%s is not supported" % ident)
    try:
        func.is_pseudo
    except AttributeError:
        raise ExpressionError(
            "The pseudo-class :%s is invalid" % ident)
    return func
 def select_selector(cache, selector):
    if selector.pseudo_element is None:
        for item in cache.iterparsedselector(selector.parsed_tree):
            yield item
        return
    if isinstance(selector.pseudo_element, FunctionalPseudoElement):
        raise ExpressionError(
            "The pseudo-element ::%s is not supported" % selector.pseudo_element.name)
    func = get_func_for_pseudo(cache, selector.pseudo_element)
    for item in cache.iterparsedselector(selector.parsed_tree):
        if func(cache, item):
            yield item
 def select_pseudo(cache, pseudo):
    func = get_func_for_pseudo(cache, pseudo.ident)
    if func is select_root:
        yield cache.root
        return
    for item in cache.iterparsedselector(pseudo.selector):
        if func(cache, item):
            yield item
@pseudo_func
 def select_root(cache, elem):
    return elem is cache.root
@pseudo_func
 def select_first_child(cache, elem):
    try:
        return cache.sibling_count(elem) == 0
    except ValueError:
        return False
@pseudo_func
 def select_last_child(cache, elem):
    try:
        return cache.sibling_count(elem, before=False) == 0
    except ValueError:
        return False
@pseudo_func
 def select_only_child(cache, elem):
    try:
        return cache.all_sibling_count(elem) == 0
    except ValueError:
        return False
@pseudo_func
 def select_first_of_type(cache, elem):
    try:
        return cache.sibling_count(elem, same_type=True) == 0
    except ValueError:
        return False
@pseudo_func
 def select_last_of_type(cache, elem):
    try:
        return cache.sibling_count(elem, before=False, same_type=True) == 0
    except ValueError:
        return False
@pseudo_func
 def select_only_of_type(cache, elem):
    try:
        return cache.all_sibling_count(elem, same_type=True) == 0
    except ValueError:
        return False
@pseudo_func
 def select_empty(cache, elem):
    return cache.is_empty(elem)
 # }}}
 default_dispatch_map = {name.partition('_')[2]:obj for name, obj in globals().items() if name.startswith('select_') and callable(obj)}
 if __name__ == '__main__':
    from pprint import pprint
    root = etree.fromstring(
            '<body xmlns="xxx" xml:lang="en"><p id="p" class="one two" lang="fr"><a id="a"/><b/><c/><d/></p></body>',
            parser=etree.XMLParser(recover=True, no_network=True, resolve_entities=False))
    select = Select(root, ignore_inappropriate_pseudo_classes=True, trace=True)
    pprint(list(select('p:disabled')))
--- a/ebook_converter/css_selectors/tests.py
+++ b/ebook_converter/css_selectors/tests.py
@@ -0,0 +1,843 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2015, Kovid Goyal <kovid at kovidgoyal.net>'
 import unittest, sys, argparse
 from lxml import etree, html
 from css_selectors.errors import SelectorSyntaxError, ExpressionError
 from css_selectors.parser import tokenize, parse
 from css_selectors.select import Select
 class TestCSSSelectors(unittest.TestCase):
    # Test data {{{
    HTML_IDS = '''
 <html id="html"><head>
  <link id="link-href" href="foo" />
  <link id="link-nohref" />
 </head><body>
 <div id="outer-div">
 <a id="name-anchor" name="foo"></a>
 <a id="tag-anchor" rel="tag" href="http://localhost/foo">link</a>
 <a id="nofollow-anchor" rel="nofollow" href="https://example.org">
    link</a>
 <ol id="first-ol" class="a b c">
   <li id="first-li">content</li>
   <li id="second-li" lang="En-us">
     <div id="li-div">
     </div>
   </li>
   <li id="third-li" class="ab c"></li>
   <li id="fourth-li" class="ab
 c"></li>
   <li id="fifth-li"></li>
   <li id="sixth-li"></li>
   <li id="seventh-li">  </li>
 </ol>
 <p id="paragraph">
   <b id="p-b">hi</b> <em id="p-em">there</em>
   <b id="p-b2">guy</b>
   <input type="checkbox" id="checkbox-unchecked" />
   <input type="checkbox" id="checkbox-disabled" disabled="" />
   <input type="text" id="text-checked" checked="checked" />
   <input type="hidden" />
   <input type="hidden" disabled="disabled" />
   <input type="checkbox" id="checkbox-checked" checked="checked" />
   <input type="checkbox" id="checkbox-disabled-checked"
          disabled="disabled" checked="checked" />
   <fieldset id="fieldset" disabled="disabled">
     <input type="checkbox" id="checkbox-fieldset-disabled" />
     <input type="hidden" />
   </fieldset>
 </p>
 <ol id="second-ol">
 </ol>
 <map name="dummymap">
   <area shape="circle" coords="200,250,25" href="foo.html" id="area-href" />
   <area shape="default" id="area-nohref" />
 </map>
 </div>
 <div id="foobar-div" foobar="ab bc
 cde"><span id="foobar-span"></span></div>
 </body></html>
 '''
    HTML_SHAKESPEARE = '''
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" debug="true">
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
 </head>
 <body>
 <div id="test">
 <div class="dialog">
 <h2>As You Like It</h2>
 <div id="playwright">
 by William Shakespeare
 </div>
 <div class="dialog scene thirdClass" id="scene1">
 <h3>ACT I, SCENE III. A room in the palace.</h3>
 <div class="dialog">
 <div class="direction">Enter CELIA and ROSALIND</div>
 </div>
 <div id="speech1" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.1">Why, cousin! why, Rosalind! Cupid have mercy! not a word?</div>
 </div>
 <div id="speech2" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.2">Not one to throw at a dog.</div>
 </div>
 <div id="speech3" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.3">No, thy words are too precious to be cast away upon</div>
 <div id="scene1.3.4">curs; throw some of them at me; come, lame me with reasons.</div>
 </div>
 <div id="speech4" class="character">ROSALIND</div>
 <div id="speech5" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.8">But is all this for your father?</div>
 </div>
 <div class="dialog">
 <div id="scene1.3.5">Then there were two cousins laid up; when the one</div>
 <div id="scene1.3.6">should be lamed with reasons and the other mad</div>
 <div id="scene1.3.7">without any.</div>
 </div>
 <div id="speech6" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.9">No, some of it is for my child's father. O, how</div>
 <div id="scene1.3.10">full of briers is this working-day world!</div>
 </div>
 <div id="speech7" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.11">They are but burs, cousin, thrown upon thee in</div>
 <div id="scene1.3.12">holiday foolery: if we walk not in the trodden</div>
 <div id="scene1.3.13">paths our very petticoats will catch them.</div>
 </div>
 <div id="speech8" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.14">I could shake them off my coat: these burs are in my heart.</div>
 </div>
 <div id="speech9" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.15">Hem them away.</div>
 </div>
 <div id="speech10" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.16">I would try, if I could cry 'hem' and have him.</div>
 </div>
 <div id="speech11" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.17">Come, come, wrestle with thy affections.</div>
 </div>
 <div id="speech12" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.18">O, they take the part of a better wrestler than myself!</div>
 </div>
 <div id="speech13" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.19">O, a good wish upon you! you will try in time, in</div>
 <div id="scene1.3.20">despite of a fall. But, turning these jests out of</div>
 <div id="scene1.3.21">service, let us talk in good earnest: is it</div>
 <div id="scene1.3.22">possible, on such a sudden, you should fall into so</div>
 <div id="scene1.3.23">strong a liking with old Sir Rowland's youngest son?</div>
 </div>
 <div id="speech14" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.24">The duke my father loved his father dearly.</div>
 </div>
 <div id="speech15" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.25">Doth it therefore ensue that you should love his son</div>
 <div id="scene1.3.26">dearly? By this kind of chase, I should hate him,</div>
 <div id="scene1.3.27">for my father hated his father dearly; yet I hate</div>
 <div id="scene1.3.28">not Orlando.</div>
 </div>
 <div id="speech16" class="character">ROSALIND</div>
 <div title="wtf" class="dialog">
 <div id="scene1.3.29">No, faith, hate him not, for my sake.</div>
 </div>
 <div id="speech17" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.30">Why should I not? doth he not deserve well?</div>
 </div>
 <div id="speech18" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.31">Let me love him for that, and do you love him</div>
 <div id="scene1.3.32">because I do. Look, here comes the duke.</div>
 </div>
 <div id="speech19" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.33">With his eyes full of anger.</div>
 <div class="direction">Enter DUKE FREDERICK, with Lords</div>
 </div>
 <div id="speech20" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.34">Mistress, dispatch you with your safest haste</div>
 <div id="scene1.3.35">And get you from our court.</div>
 </div>
 <div id="speech21" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.36">Me, uncle?</div>
 </div>
 <div id="speech22" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.37">You, cousin</div>
 <div id="scene1.3.38">Within these ten days if that thou be'st found</div>
 <div id="scene1.3.39">So near our public court as twenty miles,</div>
 <div id="scene1.3.40">Thou diest for it.</div>
 </div>
 <div id="speech23" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.41">                  I do beseech your grace,</div>
 <div id="scene1.3.42">Let me the knowledge of my fault bear with me:</div>
 <div id="scene1.3.43">If with myself I hold intelligence</div>
 <div id="scene1.3.44">Or have acquaintance with mine own desires,</div>
 <div id="scene1.3.45">If that I do not dream or be not frantic,--</div>
 <div id="scene1.3.46">As I do trust I am not--then, dear uncle,</div>
 <div id="scene1.3.47">Never so much as in a thought unborn</div>
 <div id="scene1.3.48">Did I offend your highness.</div>
 </div>
 <div id="speech24" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.49">Thus do all traitors:</div>
 <div id="scene1.3.50">If their purgation did consist in words,</div>
 <div id="scene1.3.51">They are as innocent as grace itself:</div>
 <div id="scene1.3.52">Let it suffice thee that I trust thee not.</div>
 </div>
 <div id="speech25" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.53">Yet your mistrust cannot make me a traitor:</div>
 <div id="scene1.3.54">Tell me whereon the likelihood depends.</div>
 </div>
 <div id="speech26" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.55">Thou art thy father's daughter; there's enough.</div>
 </div>
 <div id="speech27" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.56">So was I when your highness took his dukedom;</div>
 <div id="scene1.3.57">So was I when your highness banish'd him:</div>
 <div id="scene1.3.58">Treason is not inherited, my lord;</div>
 <div id="scene1.3.59">Or, if we did derive it from our friends,</div>
 <div id="scene1.3.60">What's that to me? my father was no traitor:</div>
 <div id="scene1.3.61">Then, good my liege, mistake me not so much</div>
 <div id="scene1.3.62">To think my poverty is treacherous.</div>
 </div>
 <div id="speech28" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.63">Dear sovereign, hear me speak.</div>
 </div>
 <div id="speech29" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.64">Ay, Celia; we stay'd her for your sake,</div>
 <div id="scene1.3.65">Else had she with her father ranged along.</div>
 </div>
 <div id="speech30" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.66">I did not then entreat to have her stay;</div>
 <div id="scene1.3.67">It was your pleasure and your own remorse:</div>
 <div id="scene1.3.68">I was too young that time to value her;</div>
 <div id="scene1.3.69">But now I know her: if she be a traitor,</div>
 <div id="scene1.3.70">Why so am I; we still have slept together,</div>
 <div id="scene1.3.71">Rose at an instant, learn'd, play'd, eat together,</div>
 <div id="scene1.3.72">And wheresoever we went, like Juno's swans,</div>
 <div id="scene1.3.73">Still we went coupled and inseparable.</div>
 </div>
 <div id="speech31" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.74">She is too subtle for thee; and her smoothness,</div>
 <div id="scene1.3.75">Her very silence and her patience</div>
 <div id="scene1.3.76">Speak to the people, and they pity her.</div>
 <div id="scene1.3.77">Thou art a fool: she robs thee of thy name;</div>
 <div id="scene1.3.78">And thou wilt show more bright and seem more virtuous</div>
 <div id="scene1.3.79">When she is gone. Then open not thy lips:</div>
 <div id="scene1.3.80">Firm and irrevocable is my doom</div>
 <div id="scene1.3.81">Which I have pass'd upon her; she is banish'd.</div>
 </div>
 <div id="speech32" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.82">Pronounce that sentence then on me, my liege:</div>
 <div id="scene1.3.83">I cannot live out of her company.</div>
 </div>
 <div id="speech33" class="character">DUKE FREDERICK</div>
 <div class="dialog">
 <div id="scene1.3.84">You are a fool. You, niece, provide yourself:</div>
 <div id="scene1.3.85">If you outstay the time, upon mine honour,</div>
 <div id="scene1.3.86">And in the greatness of my word, you die.</div>
 <div class="direction">Exeunt DUKE FREDERICK and Lords</div>
 </div>
 <div id="speech34" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.87">O my poor Rosalind, whither wilt thou go?</div>
 <div id="scene1.3.88">Wilt thou change fathers? I will give thee mine.</div>
 <div id="scene1.3.89">I charge thee, be not thou more grieved than I am.</div>
 </div>
 <div id="speech35" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.90">I have more cause.</div>
 </div>
 <div id="speech36" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.91">                  Thou hast not, cousin;</div>
 <div id="scene1.3.92">Prithee be cheerful: know'st thou not, the duke</div>
 <div id="scene1.3.93">Hath banish'd me, his daughter?</div>
 </div>
 <div id="speech37" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.94">That he hath not.</div>
 </div>
 <div id="speech38" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.95">No, hath not? Rosalind lacks then the love</div>
 <div id="scene1.3.96">Which teacheth thee that thou and I am one:</div>
 <div id="scene1.3.97">Shall we be sunder'd? shall we part, sweet girl?</div>
 <div id="scene1.3.98">No: let my father seek another heir.</div>
 <div id="scene1.3.99">Therefore devise with me how we may fly,</div>
 <div id="scene1.3.100">Whither to go and what to bear with us;</div>
 <div id="scene1.3.101">And do not seek to take your change upon you,</div>
 <div id="scene1.3.102">To bear your griefs yourself and leave me out;</div>
 <div id="scene1.3.103">For, by this heaven, now at our sorrows pale,</div>
 <div id="scene1.3.104">Say what thou canst, I'll go along with thee.</div>
 </div>
 <div id="speech39" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.105">Why, whither shall we go?</div>
 </div>
 <div id="speech40" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.106">To seek my uncle in the forest of Arden.</div>
 </div>
 <div id="speech41" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.107">Alas, what danger will it be to us,</div>
 <div id="scene1.3.108">Maids as we are, to travel forth so far!</div>
 <div id="scene1.3.109">Beauty provoketh thieves sooner than gold.</div>
 </div>
 <div id="speech42" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.110">I'll put myself in poor and mean attire</div>
 <div id="scene1.3.111">And with a kind of umber smirch my face;</div>
 <div id="scene1.3.112">The like do you: so shall we pass along</div>
 <div id="scene1.3.113">And never stir assailants.</div>
 </div>
 <div id="speech43" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.114">Were it not better,</div>
 <div id="scene1.3.115">Because that I am more than common tall,</div>
 <div id="scene1.3.116">That I did suit me all points like a man?</div>
 <div id="scene1.3.117">A gallant curtle-axe upon my thigh,</div>
 <div id="scene1.3.118">A boar-spear in my hand; and--in my heart</div>
 <div id="scene1.3.119">Lie there what hidden woman's fear there will--</div>
 <div id="scene1.3.120">We'll have a swashing and a martial outside,</div>
 <div id="scene1.3.121">As many other mannish cowards have</div>
 <div id="scene1.3.122">That do outface it with their semblances.</div>
 </div>
 <div id="speech44" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.123">What shall I call thee when thou art a man?</div>
 </div>
 <div id="speech45" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.124">I'll have no worse a name than Jove's own page;</div>
 <div id="scene1.3.125">And therefore look you call me Ganymede.</div>
 <div id="scene1.3.126">But what will you be call'd?</div>
 </div>
 <div id="speech46" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.127">Something that hath a reference to my state</div>
 <div id="scene1.3.128">No longer Celia, but Aliena.</div>
 </div>
 <div id="speech47" class="character">ROSALIND</div>
 <div class="dialog">
 <div id="scene1.3.129">But, cousin, what if we assay'd to steal</div>
 <div id="scene1.3.130">The clownish fool out of your father's court?</div>
 <div id="scene1.3.131">Would he not be a comfort to our travel?</div>
 </div>
 <div id="speech48" class="character">CELIA</div>
 <div class="dialog">
 <div id="scene1.3.132">He'll go along o'er the wide world with me;</div>
 <div id="scene1.3.133">Leave me alone to woo him. Let's away,</div>
 <div id="scene1.3.134">And get our jewels and our wealth together,</div>
 <div id="scene1.3.135">Devise the fittest time and safest way</div>
 <div id="scene1.3.136">To hide us from pursuit that will be made</div>
 <div id="scene1.3.137">After my flight. Now go we in content</div>
 <div id="scene1.3.138">To liberty and not to banishment.</div>
 <div class="direction">Exeunt</div>
 </div>
 </div>
 </div>
 </div>
 </body>
 </html>
 '''
 # }}}
    ae = unittest.TestCase.assertEqual
    def test_tokenizer(self):  # {{{
        tokens = [
            type('')(item) for item in tokenize(
                r'E\ é > f [a~="y\"x"]:nth(/* fu /]* */-3.7)')]
        self.ae(tokens, [
            "<IDENT 'E é' at 0>",
            "<S ' ' at 4>",
            "<DELIM '>' at 5>",
            "<S ' ' at 6>",
            # the no-break space is not whitespace in CSS
            "<IDENT 'f ' at 7>",  # f\xa0
            "<DELIM '[' at 9>",
            "<IDENT 'a' at 10>",
            "<DELIM '~' at 11>",
            "<DELIM '=' at 12>",
            "<STRING 'y\"x' at 13>",
            "<DELIM ']' at 19>",
            "<DELIM ':' at 20>",
            "<IDENT 'nth' at 21>",
            "<DELIM '(' at 24>",
            "<NUMBER '-3.7' at 37>",
            "<DELIM ')' at 41>",
            "<EOF at 42>",
        ])
    # }}}
    def test_parser(self):  # {{{
        def repr_parse(css):
            selectors = parse(css)
            for selector in selectors:
                assert selector.pseudo_element is None
            return [repr(selector.parsed_tree).replace("(u'", "('")
                    for selector in selectors]
        def parse_many(first, *others):
            result = repr_parse(first)
            for other in others:
                assert repr_parse(other) == result
            return result
        assert parse_many('*') == ['Element[*]']
        assert parse_many('*|*') == ['Element[*]']
        assert parse_many('*|foo') == ['Element[foo]']
        assert parse_many('foo|*') == ['Element[foo|*]']
        assert parse_many('foo|bar') == ['Element[foo|bar]']
        # This will never match, but it is valid:
        assert parse_many('#foo#bar') == ['Hash[Hash[Element[*]#foo]#bar]']
        assert parse_many(
            'div>.foo',
            'div> .foo',
            'div >.foo',
            'div > .foo',
            'div \n>  \t \t .foo', 'div\r>\n\n\n.foo', 'div\f>\f.foo'
        ) == ['CombinedSelector[Element[div] > Class[Element[*].foo]]']
        assert parse_many('td.foo,.bar',
            'td.foo, .bar',
            'td.foo\t\r\n\f ,\t\r\n\f .bar'
        ) == [
            'Class[Element[td].foo]',
            'Class[Element[*].bar]'
        ]
        assert parse_many('div, td.foo, div.bar span') == [
            'Element[div]',
            'Class[Element[td].foo]',
            'CombinedSelector[Class[Element[div].bar] '
            '<followed> Element[span]]']
        assert parse_many('div > p') == [
            'CombinedSelector[Element[div] > Element[p]]']
        assert parse_many('td:first') == [
            'Pseudo[Element[td]:first]']
        assert parse_many('td:first') == [
            'Pseudo[Element[td]:first]']
        assert parse_many('td :first') == [
            'CombinedSelector[Element[td] '
            '<followed> Pseudo[Element[*]:first]]']
        assert parse_many('td :first') == [
            'CombinedSelector[Element[td] '
            '<followed> Pseudo[Element[*]:first]]']
        assert parse_many('a[name]', 'a[ name\t]') == [
            'Attrib[Element[a][name]]']
        assert parse_many('a [name]') == [
            'CombinedSelector[Element[a] <followed> Attrib[Element[*][name]]]']
        self.ae(parse_many('a[rel="include"]', 'a[rel = include]'), [
            "Attrib[Element[a][rel = 'include']]"])
        assert parse_many("a[hreflang |= 'en']", "a[hreflang|=en]") == [
            "Attrib[Element[a][hreflang |= 'en']]"]
        self.ae(parse_many('div:nth-child(10)'), [
            "Function[Element[div]:nth-child(['10'])]"])
        assert parse_many(':nth-child(2n+2)') == [
            "Function[Element[*]:nth-child(['2', 'n', '+2'])]"]
        assert parse_many('div:nth-of-type(10)') == [
            "Function[Element[div]:nth-of-type(['10'])]"]
        assert parse_many('div div:nth-of-type(10) .aclass') == [
            'CombinedSelector[CombinedSelector[Element[div] <followed> '
            "Function[Element[div]:nth-of-type(['10'])]] "
            '<followed> Class[Element[*].aclass]]']
        assert parse_many('label:only') == [
            'Pseudo[Element[label]:only]']
        assert parse_many('a:lang(fr)') == [
            "Function[Element[a]:lang(['fr'])]"]
        assert parse_many('div:contains("foo")') == [
            "Function[Element[div]:contains(['foo'])]"]
        assert parse_many('div#foobar') == [
            'Hash[Element[div]#foobar]']
        assert parse_many('div:not(div.foo)') == [
            'Negation[Element[div]:not(Class[Element[div].foo])]']
        assert parse_many('td ~ th') == [
            'CombinedSelector[Element[td] ~ Element[th]]']
    # }}}
    def test_pseudo_elements(self):  # {{{
        def parse_pseudo(css):
            result = []
            for selector in parse(css):
                pseudo = selector.pseudo_element
                pseudo = type('')(pseudo) if pseudo else pseudo
                # No Symbol here
                assert pseudo is None or isinstance(pseudo, type(''))
                selector = repr(selector.parsed_tree).replace("(u'", "('")
                result.append((selector, pseudo))
            return result
        def parse_one(css):
            result = parse_pseudo(css)
            assert len(result) == 1
            return result[0]
        self.ae(parse_one('foo'), ('Element[foo]', None))
        self.ae(parse_one('*'), ('Element[*]', None))
        self.ae(parse_one(':empty'), ('Pseudo[Element[*]:empty]', None))
        # Special cases for CSS 2.1 pseudo-elements
        self.ae(parse_one(':BEfore'), ('Element[*]', 'before'))
        self.ae(parse_one(':aftER'), ('Element[*]', 'after'))
        self.ae(parse_one(':First-Line'), ('Element[*]', 'first-line'))
        self.ae(parse_one(':First-Letter'), ('Element[*]', 'first-letter'))
        self.ae(parse_one('::befoRE'), ('Element[*]', 'before'))
        self.ae(parse_one('::AFter'), ('Element[*]', 'after'))
        self.ae(parse_one('::firsT-linE'), ('Element[*]', 'first-line'))
        self.ae(parse_one('::firsT-letteR'), ('Element[*]', 'first-letter'))
        self.ae(parse_one('::text-content'), ('Element[*]', 'text-content'))
        self.ae(parse_one('::attr(name)'), (
            "Element[*]", "FunctionalPseudoElement[::attr(['name'])]"))
        self.ae(parse_one('::Selection'), ('Element[*]', 'selection'))
        self.ae(parse_one('foo:after'), ('Element[foo]', 'after'))
        self.ae(parse_one('foo::selection'), ('Element[foo]', 'selection'))
        self.ae(parse_one('lorem#ipsum ~ a#b.c[href]:empty::selection'), (
            'CombinedSelector[Hash[Element[lorem]#ipsum] ~ '
            'Pseudo[Attrib[Class[Hash[Element[a]#b].c][href]]:empty]]',
            'selection'))
        parse_pseudo('foo:before, bar, baz:after') == [
            ('Element[foo]', 'before'),
            ('Element[bar]', None),
            ('Element[baz]', 'after')]
    # }}}
    def test_specificity(self):  # {{{
        def specificity(css):
            selectors = parse(css)
            assert len(selectors) == 1
            return selectors[0].specificity()
        assert specificity('*') == (0, 0, 0)
        assert specificity(' foo') == (0, 0, 1)
        assert specificity(':empty ') == (0, 1, 0)
        assert specificity(':before') == (0, 0, 1)
        assert specificity('*:before') == (0, 0, 1)
        assert specificity(':nth-child(2)') == (0, 1, 0)
        assert specificity('.bar') == (0, 1, 0)
        assert specificity('[baz]') == (0, 1, 0)
        assert specificity('[baz="4"]') == (0, 1, 0)
        assert specificity('[baz^="4"]') == (0, 1, 0)
        assert specificity('#lipsum') == (1, 0, 0)
        assert specificity(':not(*)') == (0, 0, 0)
        assert specificity(':not(foo)') == (0, 0, 1)
        assert specificity(':not(.foo)') == (0, 1, 0)
        assert specificity(':not([foo])') == (0, 1, 0)
        assert specificity(':not(:empty)') == (0, 1, 0)
        assert specificity(':not(#foo)') == (1, 0, 0)
        assert specificity('foo:empty') == (0, 1, 1)
        assert specificity('foo:before') == (0, 0, 2)
        assert specificity('foo::before') == (0, 0, 2)
        assert specificity('foo:empty::before') == (0, 1, 2)
        assert specificity('#lorem + foo#ipsum:first-child > bar:first-line'
            ) == (2, 1, 3)
    # }}}
    def test_parse_errors(self):  # {{{
        def get_error(css):
            try:
                parse(css)
            except SelectorSyntaxError:
                # Py2, Py3, ...
                return str(sys.exc_info()[1]).replace("(u'", "('")
        self.ae(get_error('attributes(href)/html/body/a'), (
            "Expected selector, got <DELIM '(' at 10>"))
        assert get_error('attributes(href)') == (
            "Expected selector, got <DELIM '(' at 10>")
        assert get_error('html/body/a') == (
            "Expected selector, got <DELIM '/' at 4>")
        assert get_error(' ') == (
            "Expected selector, got <EOF at 1>")
        assert get_error('div, ') == (
            "Expected selector, got <EOF at 5>")
        assert get_error(' , div') == (
            "Expected selector, got <DELIM ',' at 1>")
        assert get_error('p, , div') == (
            "Expected selector, got <DELIM ',' at 3>")
        assert get_error('div > ') == (
            "Expected selector, got <EOF at 6>")
        assert get_error('  > div') == (
            "Expected selector, got <DELIM '>' at 2>")
        assert get_error('foo|#bar') == (
            "Expected ident or '*', got <HASH 'bar' at 4>")
        assert get_error('#.foo') == (
            "Expected selector, got <DELIM '#' at 0>")
        assert get_error('.#foo') == (
            "Expected ident, got <HASH 'foo' at 1>")
        assert get_error(':#foo') == (
            "Expected ident, got <HASH 'foo' at 1>")
        assert get_error('[*]') == (
            "Expected '|', got <DELIM ']' at 2>")
        assert get_error('[foo|]') == (
            "Expected ident, got <DELIM ']' at 5>")
        assert get_error('[#]') == (
            "Expected ident or '*', got <DELIM '#' at 1>")
        assert get_error('[foo=#]') == (
            "Expected string or ident, got <DELIM '#' at 5>")
        assert get_error('[href]a') == (
            "Expected selector, got <IDENT 'a' at 6>")
        assert get_error('[rel=stylesheet]') is None
        assert get_error('[rel:stylesheet]') == (
            "Operator expected, got <DELIM ':' at 4>")
        assert get_error('[rel=stylesheet') == (
            "Expected ']', got <EOF at 15>")
        assert get_error(':lang(fr)') is None
        assert get_error(':lang(fr') == (
            "Expected an argument, got <EOF at 8>")
        assert get_error(':contains("foo') == (
            "Unclosed string at 10")
        assert get_error('foo!') == (
            "Expected selector, got <DELIM '!' at 3>")
        # Mis-placed pseudo-elements
        assert get_error('a:before:empty') == (
            "Got pseudo-element ::before not at the end of a selector")
        assert get_error('li:before a') == (
            "Got pseudo-element ::before not at the end of a selector")
        assert get_error(':not(:before)') == (
            "Got pseudo-element ::before inside :not() at 12")
        assert get_error(':not(:not(a))') == (
            "Got nested :not()")
    # }}}
    def test_select(self):  # {{{
        document = etree.fromstring(self.HTML_IDS, parser=etree.XMLParser(recover=True, no_network=True, resolve_entities=False))
        select = Select(document)
        def select_ids(selector):
            for elem in select(selector):
                yield elem.get('id')
        def pcss(main, *selectors, **kwargs):
            result = list(select_ids(main))
            for selector in selectors:
                self.ae(list(select_ids(selector)), result)
            return result
        all_ids = pcss('*')
        self.ae(all_ids[:6], [
            'html', None, 'link-href', 'link-nohref', None, 'outer-div'])
        self.ae(all_ids[-1:], ['foobar-span'])
        self.ae(pcss('div'), ['outer-div', 'li-div', 'foobar-div'])
        self.ae(pcss('DIV'), [
            'outer-div', 'li-div', 'foobar-div'])  # case-insensitive in HTML
        self.ae(pcss('div div'), ['li-div'])
        self.ae(pcss('div, div div'), ['outer-div', 'li-div', 'foobar-div'])
        self.ae(pcss('a[name]'), ['name-anchor'])
        self.ae(pcss('a[NAme]'), ['name-anchor'])  # case-insensitive in HTML:
        self.ae(pcss('a[rel]'), ['tag-anchor', 'nofollow-anchor'])
        self.ae(pcss('a[rel="tag"]'), ['tag-anchor'])
        self.ae(pcss('a[href*="localhost"]'), ['tag-anchor'])
        self.ae(pcss('a[href*=""]'), [])
        self.ae(pcss('a[href^="http"]'), ['tag-anchor', 'nofollow-anchor'])
        self.ae(pcss('a[href^="http:"]'), ['tag-anchor'])
        self.ae(pcss('a[href^=""]'), [])
        self.ae(pcss('a[href$="org"]'), ['nofollow-anchor'])
        self.ae(pcss('a[href$=""]'), [])
        self.ae(pcss('div[foobar~="bc"]', 'div[foobar~="cde"]', skip_webkit=True), ['foobar-div'])
        self.ae(pcss('[foobar~="ab bc"]', '[foobar~=""]', '[foobar~=" \t"]'), [])
        self.ae(pcss('div[foobar~="cd"]'), [])
        self.ae(pcss('*[lang|="En"]', '[lang|="En-us"]'), ['second-li'])
        # Attribute values are case sensitive
        self.ae(pcss('*[lang|="en"]', '[lang|="en-US"]', skip_webkit=True), [])
        self.ae(pcss('*[lang|="e"]'), [])
        self.ae(pcss(':lang("EN")', '*:lang(en-US)', skip_webkit=True), ['second-li', 'li-div'])
        self.ae(pcss(':lang("e")'), [])
        self.ae(pcss('li:nth-child(1)', 'li:first-child'), ['first-li'])
        self.ae(pcss('li:nth-child(3)', '#first-li ~ :nth-child(3)'), ['third-li'])
        self.ae(pcss('li:nth-child(10)'), [])
        self.ae(pcss('li:nth-child(2n)', 'li:nth-child(even)', 'li:nth-child(2n+0)'), ['second-li', 'fourth-li', 'sixth-li'])
        self.ae(pcss('li:nth-child(+2n+1)', 'li:nth-child(odd)'), ['first-li', 'third-li', 'fifth-li', 'seventh-li'])
        self.ae(pcss('li:nth-child(2n+4)'), ['fourth-li', 'sixth-li'])
        self.ae(pcss('li:nth-child(3n+1)'), ['first-li', 'fourth-li', 'seventh-li'])
        self.ae(pcss('li:nth-last-child(0)'), [])
        self.ae(pcss('li:nth-last-child(1)', 'li:last-child'), ['seventh-li'])
        self.ae(pcss('li:nth-last-child(2n)', 'li:nth-last-child(even)'), ['second-li', 'fourth-li', 'sixth-li'])
        self.ae(pcss('li:nth-last-child(2n+2)'), ['second-li', 'fourth-li', 'sixth-li'])
        self.ae(pcss('ol:first-of-type'), ['first-ol'])
        self.ae(pcss('ol:nth-child(1)'), [])
        self.ae(pcss('ol:nth-of-type(2)'), ['second-ol'])
        self.ae(pcss('ol:nth-last-of-type(1)'), ['second-ol'])
        self.ae(pcss('span:only-child'), ['foobar-span'])
        self.ae(pcss('li div:only-child'), ['li-div'])
        self.ae(pcss('div *:only-child'), ['li-div', 'foobar-span'])
        self.ae(pcss('p *:only-of-type', skip_webkit=True), ['p-em', 'fieldset'])
        self.ae(pcss('p:only-of-type', skip_webkit=True), ['paragraph'])
        self.ae(pcss('a:empty', 'a:EMpty'), ['name-anchor'])
        self.ae(pcss('li:empty'), ['third-li', 'fourth-li', 'fifth-li', 'sixth-li'])
        self.ae(pcss(':root', 'html:root', 'li:root'), ['html'])
        self.ae(pcss('* :root', 'p *:root'), [])
        self.ae(pcss('.a', '.b', '*.a', 'ol.a'), ['first-ol'])
        self.ae(pcss('.c', '*.c'), ['first-ol', 'third-li', 'fourth-li'])
        self.ae(pcss('ol *.c', 'ol li.c', 'li ~ li.c', 'ol > li.c'), [
            'third-li', 'fourth-li'])
        self.ae(pcss('#first-li', 'li#first-li', '*#first-li'), ['first-li'])
        self.ae(pcss('li div', 'li > div', 'div div'), ['li-div'])
        self.ae(pcss('div > div'), [])
        self.ae(pcss('div>.c', 'div > .c'), ['first-ol'])
        self.ae(pcss('div + div'), ['foobar-div'])
        self.ae(pcss('a ~ a'), ['tag-anchor', 'nofollow-anchor'])
        self.ae(pcss('a[rel="tag"] ~ a'), ['nofollow-anchor'])
        self.ae(pcss('ol#first-ol li:last-child'), ['seventh-li'])
        self.ae(pcss('ol#first-ol *:last-child'), ['li-div', 'seventh-li'])
        self.ae(pcss('#outer-div:first-child'), ['outer-div'])
        self.ae(pcss('#outer-div :first-child'), [
            'name-anchor', 'first-li', 'li-div', 'p-b',
            'checkbox-fieldset-disabled', 'area-href'])
        self.ae(pcss('a[href]'), ['tag-anchor', 'nofollow-anchor'])
        self.ae(pcss(':not(*)'), [])
        self.ae(pcss('a:not([href])'), ['name-anchor'])
        self.ae(pcss('ol :Not(li[class])', skip_webkit=True), [
            'first-li', 'second-li', 'li-div',
            'fifth-li', 'sixth-li', 'seventh-li'])
        self.ae(pcss(r'di\a0 v', r'div\['), [])
        self.ae(pcss(r'[h\a0 ref]', r'[h\]ref]'), [])
        self.assertRaises(ExpressionError, lambda : tuple(select('body:nth-child')))
        select = Select(document, ignore_inappropriate_pseudo_classes=True)
        self.assertGreater(len(tuple(select('p:hover'))), 0)
    def test_select_shakespeare(self):
        document = html.document_fromstring(self.HTML_SHAKESPEARE)
        select = Select(document)
        count = lambda s: sum(1 for r in select(s))
        # Data borrowed from http://mootools.net/slickspeed/
        # Changed from original; probably because I'm only
        self.ae(count('*'), 249)
        assert count('div:only-child') == 22  # ?
        assert count('div:nth-child(even)') == 106
        assert count('div:nth-child(2n)') == 106
        assert count('div:nth-child(odd)') == 137
        assert count('div:nth-child(2n+1)') == 137
        assert count('div:nth-child(n)') == 243
        assert count('div:last-child') == 53
        assert count('div:first-child') == 51
        assert count('div > div') == 242
        assert count('div + div') == 190
        assert count('div ~ div') == 190
        assert count('body') == 1
        assert count('body div') == 243
        assert count('div') == 243
        assert count('div div') == 242
        assert count('div div div') == 241
        assert count('div, div, div') == 243
        assert count('div, a, span') == 243
        assert count('.dialog') == 51
        assert count('div.dialog') == 51
        assert count('div .dialog') == 51
        assert count('div.character, div.dialog') == 99
        assert count('div.direction.dialog') == 0
        assert count('div.dialog.direction') == 0
        assert count('div.dialog.scene') == 1
        assert count('div.scene.scene') == 1
        assert count('div.scene .scene') == 0
        assert count('div.direction .dialog ') == 0
        assert count('div .dialog .direction') == 4
        assert count('div.dialog .dialog .direction') == 4
        assert count('#speech5') == 1
        assert count('div#speech5') == 1
        assert count('div #speech5') == 1
        assert count('div.scene div.dialog') == 49
        assert count('div#scene1 div.dialog div') == 142
        assert count('#scene1 #speech1') == 1
        assert count('div[class]') == 103
        assert count('div[class=dialog]') == 50
        assert count('div[class^=dia]') == 51
        assert count('div[class$=log]') == 50
        assert count('div[class*=sce]') == 1
        assert count('div[class|=dialog]') == 50  # ? Seems right
        assert count('div[class~=dialog]') == 51  # ? Seems right
    # }}}
 # Run tests {{{
 def find_tests():
    return unittest.defaultTestLoader.loadTestsFromTestCase(TestCSSSelectors)
 def run_tests(find_tests=find_tests, for_build=False):
    if not for_build:
        parser = argparse.ArgumentParser()
        parser.add_argument('name', nargs='?', default=None,
                            help='The name of the test to run')
        args = parser.parse_args()
    if not for_build and args.name and args.name.startswith('.'):
        tests = find_tests()
        q = args.name[1:]
        if not q.startswith('test_'):
            q = 'test_' + q
        ans = None
        try:
            for test in tests:
                if test._testMethodName == q:
                    ans = test
                    raise StopIteration()
        except StopIteration:
            pass
        if ans is None:
            print('No test named %s found' % args.name)
            raise SystemExit(1)
        tests = ans
    else:
        tests = unittest.defaultTestLoader.loadTestsFromName(args.name) if not for_build and args.name else find_tests()
    r = unittest.TextTestRunner
    if for_build:
        r = r(verbosity=0, buffer=True, failfast=True)
    else:
        r = r(verbosity=4)
    result = r.run(tests)
    if for_build and result.errors or result.failures:
        raise SystemExit(1)
 if __name__ == '__main__':
    run_tests()
 # }}}
--- a/ebook_converter/customize/init.py
+++ b/ebook_converter/customize/init.py
@@ -0,0 +1,759 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import os, sys, zipfile, importlib
 from calibre.constants import numeric_version, iswindows, isosx
 from calibre.ptempfile import PersistentTemporaryFile
 from polyglot.builtins import unicode_type
 platform = 'linux'
 if iswindows:
    platform = 'windows'
 elif isosx:
    platform = 'osx'
 class PluginNotFound(ValueError):
    pass
 class InvalidPlugin(ValueError):
    pass
 class Plugin(object):  # {{{
    '''
    A calibre plugin. Useful members include:
       * ``self.plugin_path``: Stores path to the ZIP file that contains
                               this plugin or None if it is a builtin
                               plugin
       * ``self.site_customization``: Stores a customization string entered
                                      by the user.
    Methods that should be overridden in sub classes:
       * :meth:`initialize`
       * :meth:`customization_help`
    Useful methods:
        * :meth:`temporary_file`
        * :meth:`__enter__`
        * :meth:`load_resources`
    '''
    #: List of platforms this plugin works on.
    #: For example: ``['windows', 'osx', 'linux']``
    supported_platforms = []
    #: The name of this plugin. You must set it something other
    #: than Trivial Plugin for it to work.
    name           = 'Trivial Plugin'
    #: The version of this plugin as a 3-tuple (major, minor, revision)
    version        = (1, 0, 0)
    #: A short string describing what this plugin does
    description    = _('Does absolutely nothing')
    #: The author of this plugin
    author         = _('Unknown')
    #: When more than one plugin exists for a filetype,
    #: the plugins are run in order of decreasing priority.
    #: Plugins with higher priority will be run first.
    #: The highest possible priority is ``sys.maxsize``.
    #: Default priority is 1.
    priority = 1
    #: The earliest version of calibre this plugin requires
    minimum_calibre_version = (0, 4, 118)
    #: If False, the user will not be able to disable this plugin. Use with
    #: care.
    can_be_disabled = True
    #: The type of this plugin. Used for categorizing plugins in the
    #: GUI
    type = _('Base')
    def __init__(self, plugin_path):
        self.plugin_path        = plugin_path
        self.site_customization = None
    def initialize(self):
        '''
        Called once when calibre plugins are initialized.  Plugins are
        re-initialized every time a new plugin is added. Also note that if the
        plugin is run in a worker process, such as for adding books, then the
        plugin will be initialized for every new worker process.
        Perform any plugin specific initialization here, such as extracting
        resources from the plugin ZIP file. The path to the ZIP file is
        available as ``self.plugin_path``.
        Note that ``self.site_customization`` is **not** available at this point.
        '''
        pass
    def config_widget(self):
        '''
        Implement this method and :meth:`save_settings` in your plugin to
        use a custom configuration dialog, rather then relying on the simple
        string based default customization.
        This method, if implemented, must return a QWidget. The widget can have
        an optional method validate() that takes no arguments and is called
        immediately after the user clicks OK. Changes are applied if and only
        if the method returns True.
        If for some reason you cannot perform the configuration at this time,
        return a tuple of two strings (message, details), these will be
        displayed as a warning dialog to the user and the process will be
        aborted.
        '''
        raise NotImplementedError()
    def save_settings(self, config_widget):
        '''
        Save the settings specified by the user with config_widget.
        :param config_widget: The widget returned by :meth:`config_widget`.
        '''
        raise NotImplementedError()
    def do_user_config(self, parent=None):
        '''
        This method shows a configuration dialog for this plugin. It returns
        True if the user clicks OK, False otherwise. The changes are
        automatically applied.
        '''
        from PyQt5.Qt import QDialog, QDialogButtonBox, QVBoxLayout, \
                QLabel, Qt, QLineEdit
        from calibre.gui2 import gprefs
        prefname = 'plugin config dialog:'+self.type + ':' + self.name
        geom = gprefs.get(prefname, None)
        config_dialog = QDialog(parent)
        button_box = QDialogButtonBox(QDialogButtonBox.Ok | QDialogButtonBox.Cancel)
        v = QVBoxLayout(config_dialog)
        def size_dialog():
            if geom is None:
                config_dialog.resize(config_dialog.sizeHint())
            else:
                from PyQt5.Qt import QApplication
                QApplication.instance().safe_restore_geometry(config_dialog, geom)
        button_box.accepted.connect(config_dialog.accept)
        button_box.rejected.connect(config_dialog.reject)
        config_dialog.setWindowTitle(_('Customize') + ' ' + self.name)
        try:
            config_widget = self.config_widget()
        except NotImplementedError:
            config_widget = None
        if isinstance(config_widget, tuple):
            from calibre.gui2 import warning_dialog
            warning_dialog(parent, _('Cannot configure'), config_widget[0],
                    det_msg=config_widget[1], show=True)
            return False
        if config_widget is not None:
            v.addWidget(config_widget)
            v.addWidget(button_box)
            size_dialog()
            config_dialog.exec_()
            if config_dialog.result() == QDialog.Accepted:
                if hasattr(config_widget, 'validate'):
                    if config_widget.validate():
                        self.save_settings(config_widget)
                else:
                    self.save_settings(config_widget)
        else:
            from calibre.customize.ui import plugin_customization, \
                customize_plugin
            help_text = self.customization_help(gui=True)
            help_text = QLabel(help_text, config_dialog)
            help_text.setWordWrap(True)
            help_text.setTextInteractionFlags(Qt.LinksAccessibleByMouse | Qt.LinksAccessibleByKeyboard)
            help_text.setOpenExternalLinks(True)
            v.addWidget(help_text)
            sc = plugin_customization(self)
            if not sc:
                sc = ''
            sc = sc.strip()
            sc = QLineEdit(sc, config_dialog)
            v.addWidget(sc)
            v.addWidget(button_box)
            size_dialog()
            config_dialog.exec_()
            if config_dialog.result() == QDialog.Accepted:
                sc = unicode_type(sc.text()).strip()
                customize_plugin(self, sc)
        geom = bytearray(config_dialog.saveGeometry())
        gprefs[prefname] = geom
        return config_dialog.result()
    def load_resources(self, names):
        '''
        If this plugin comes in a ZIP file (user added plugin), this method
        will allow you to load resources from the ZIP file.
        For example to load an image::
            pixmap = QPixmap()
            pixmap.loadFromData(self.load_resources(['images/icon.png'])['images/icon.png'])
            icon = QIcon(pixmap)
        :param names: List of paths to resources in the ZIP file using / as separator
        :return: A dictionary of the form ``{name: file_contents}``. Any names
                 that were not found in the ZIP file will not be present in the
                 dictionary.
        '''
        if self.plugin_path is None:
            raise ValueError('This plugin was not loaded from a ZIP file')
        ans = {}
        with zipfile.ZipFile(self.plugin_path, 'r') as zf:
            for candidate in zf.namelist():
                if candidate in names:
                    ans[candidate] = zf.read(candidate)
        return ans
    def customization_help(self, gui=False):
        '''
        Return a string giving help on how to customize this plugin.
        By default raise a :class:`NotImplementedError`, which indicates that
        the plugin does not require customization.
        If you re-implement this method in your subclass, the user will
        be asked to enter a string as customization for this plugin.
        The customization string will be available as
        ``self.site_customization``.
        Site customization could be anything, for example, the path to
        a needed binary on the user's computer.
        :param gui: If True return HTML help, otherwise return plain text help.
        '''
        raise NotImplementedError()
    def temporary_file(self, suffix):
        '''
        Return a file-like object that is a temporary file on the file system.
        This file will remain available even after being closed and will only
        be removed on interpreter shutdown. Use the ``name`` member of the
        returned object to access the full path to the created temporary file.
        :param suffix: The suffix that the temporary file will have.
        '''
        return PersistentTemporaryFile(suffix)
    def is_customizable(self):
        try:
            self.customization_help()
            return True
        except NotImplementedError:
            return False
    def __enter__(self, *args):
        '''
        Add this plugin to the python path so that it's contents become directly importable.
        Useful when bundling large python libraries into the plugin. Use it like this::
            with plugin:
                import something
        '''
        if self.plugin_path is not None:
            from calibre.utils.zipfile import ZipFile
            zf = ZipFile(self.plugin_path)
            extensions = {x.rpartition('.')[-1].lower() for x in
                zf.namelist()}
            zip_safe = True
            for ext in ('pyd', 'so', 'dll', 'dylib'):
                if ext in extensions:
                    zip_safe = False
                    break
            if zip_safe:
                sys.path.insert(0, self.plugin_path)
                self.sys_insertion_path = self.plugin_path
            else:
                from calibre.ptempfile import TemporaryDirectory
                self._sys_insertion_tdir = TemporaryDirectory('plugin_unzip')
                self.sys_insertion_path = self._sys_insertion_tdir.__enter__(*args)
                zf.extractall(self.sys_insertion_path)
                sys.path.insert(0, self.sys_insertion_path)
            zf.close()
    def __exit__(self, *args):
        ip, it = getattr(self, 'sys_insertion_path', None), getattr(self,
                '_sys_insertion_tdir', None)
        if ip in sys.path:
            sys.path.remove(ip)
        if hasattr(it, '__exit__'):
            it.__exit__(*args)
    def cli_main(self, args):
        '''
        This method is the main entry point for your plugins command line
        interface. It is called when the user does: calibre-debug -r "Plugin
        Name". Any arguments passed are present in the args variable.
        '''
        raise NotImplementedError('The %s plugin has no command line interface'
                                  %self.name)
 # }}}
 class FileTypePlugin(Plugin):  # {{{
    '''
    A plugin that is associated with a particular set of file types.
    '''
    #: Set of file types for which this plugin should be run.
    #: Use '*' for all file types.
    #: For example: ``{'lit', 'mobi', 'prc'}``
    file_types     = set()
    #: If True, this plugin is run when books are added
    #: to the database
    on_import      = False
    #: If True, this plugin is run after books are added
    #: to the database. In this case the postimport and postadd
    #: methods of the plugin are called.
    on_postimport  = False
    #: If True, this plugin is run just before a conversion
    on_preprocess  = False
    #: If True, this plugin is run after conversion
    #: on the final file produced by the conversion output plugin.
    on_postprocess = False
    type = _('File type')
    def run(self, path_to_ebook):
        '''
        Run the plugin. Must be implemented in subclasses.
        It should perform whatever modifications are required
        on the e-book and return the absolute path to the
        modified e-book. If no modifications are needed, it should
        return the path to the original e-book. If an error is encountered
        it should raise an Exception. The default implementation
        simply return the path to the original e-book. Note that the path to
        the original file (before any file type plugins are run, is available as
        self.original_path_to_file).
        The modified e-book file should be created with the
        :meth:`temporary_file` method.
        :param path_to_ebook: Absolute path to the e-book.
        :return: Absolute path to the modified e-book.
        '''
        # Default implementation does nothing
        return path_to_ebook
    def postimport(self, book_id, book_format, db):
        '''
        Called post import, i.e., after the book file has been added to the database. Note that
        this is different from :meth:`postadd` which is called when the book record is created for
        the first time. This method is called whenever a new file is added to a book record. It is
        useful for modifying the book record based on the contents of the newly added file.
        :param book_id: Database id of the added book.
        :param book_format: The file type of the book that was added.
        :param db: Library database.
        '''
        pass  # Default implementation does nothing
    def postadd(self, book_id, fmt_map, db):
        '''
        Called post add, i.e. after a book has been added to the db. Note that
        this is different from :meth:`postimport`, which is called after a single book file
        has been added to a book. postadd() is called only when an entire book record
        with possibly more than one book file has been created for the first time.
        This is useful if you wish to modify the book record in the database when the
        book is first added to calibre.
        :param book_id: Database id of the added book.
        :param fmt_map: Map of file format to path from which the file format
            was added. Note that this might or might not point to an actual
            existing file, as sometimes files are added as streams. In which case
            it might be a dummy value or a non-existent path.
        :param db: Library database
        '''
        pass  # Default implementation does nothing
 # }}}
 class MetadataReaderPlugin(Plugin):  # {{{
    '''
    A plugin that implements reading metadata from a set of file types.
    '''
    #: Set of file types for which this plugin should be run.
    #: For example: ``set(['lit', 'mobi', 'prc'])``
    file_types     = set()
    supported_platforms = ['windows', 'osx', 'linux']
    version = numeric_version
    author  = 'Kovid Goyal'
    type = _('Metadata reader')
    def __init__(self, *args, **kwargs):
        Plugin.__init__(self, *args, **kwargs)
        self.quick = False
    def get_metadata(self, stream, type):
        '''
        Return metadata for the file represented by stream (a file like object
        that supports reading). Raise an exception when there is an error
        with the input data.
        :param type: The type of file. Guaranteed to be one of the entries
            in :attr:`file_types`.
        :return: A :class:`calibre.ebooks.metadata.book.Metadata` object
        '''
        return None
 # }}}
 class MetadataWriterPlugin(Plugin):  # {{{
    '''
    A plugin that implements reading metadata from a set of file types.
    '''
    #: Set of file types for which this plugin should be run.
    #: For example: ``set(['lit', 'mobi', 'prc'])``
    file_types     = set()
    supported_platforms = ['windows', 'osx', 'linux']
    version = numeric_version
    author  = 'Kovid Goyal'
    type = _('Metadata writer')
    def __init__(self, *args, **kwargs):
        Plugin.__init__(self, *args, **kwargs)
        self.apply_null = False
    def set_metadata(self, stream, mi, type):
        '''
        Set metadata for the file represented by stream (a file like object
        that supports reading). Raise an exception when there is an error
        with the input data.
        :param type: The type of file. Guaranteed to be one of the entries
            in :attr:`file_types`.
        :param mi: A :class:`calibre.ebooks.metadata.book.Metadata` object
        '''
        pass
 # }}}
 class CatalogPlugin(Plugin):  # {{{
    '''
    A plugin that implements a catalog generator.
    '''
    resources_path = None
    #: Output file type for which this plugin should be run.
    #: For example: 'epub' or 'xml'
    file_types = set()
    type = _('Catalog generator')
    #: CLI parser options specific to this plugin, declared as namedtuple Option:
    #:
    #:   from collections import namedtuple
    #:   Option = namedtuple('Option', 'option, default, dest, help')
    #:   cli_options = [Option('--catalog-title', default = 'My Catalog',
    #:   dest = 'catalog_title', help = (_('Title of generated catalog. \nDefault:') + " '" + '%default' + "'"))]
    #:   cli_options parsed in calibre.db.cli.cmd_catalog:option_parser()
    #:
    cli_options = []
    def _field_sorter(self, key):
        '''
        Custom fields sort after standard fields
        '''
        if key.startswith('#'):
            return '~%s' % key[1:]
        else:
            return key
    def search_sort_db(self, db, opts):
        db.search(opts.search_text)
        if opts.sort_by:
            # 2nd arg = ascending
            db.sort(opts.sort_by, True)
        return db.get_data_as_dict(ids=opts.ids)
    def get_output_fields(self, db, opts):
        # Return a list of requested fields
        all_std_fields = {'author_sort','authors','comments','cover','formats',
                           'id','isbn','library_name','ondevice','pubdate','publisher',
                           'rating','series_index','series','size','tags','timestamp',
                           'title_sort','title','uuid','languages','identifiers'}
        all_custom_fields = set(db.custom_field_keys())
        for field in list(all_custom_fields):
            fm = db.field_metadata[field]
            if fm['datatype'] == 'series':
                all_custom_fields.add(field+'_index')
        all_fields = all_std_fields.union(all_custom_fields)
        if opts.fields != 'all':
            # Make a list from opts.fields
            of = [x.strip() for x in opts.fields.split(',')]
            requested_fields = set(of)
            # Validate requested_fields
            if requested_fields - all_fields:
                from calibre.library import current_library_name
                invalid_fields = sorted(list(requested_fields - all_fields))
                print("invalid --fields specified: %s" % ', '.join(invalid_fields))
                print("available fields in '%s': %s" %
                      (current_library_name(), ', '.join(sorted(list(all_fields)))))
                raise ValueError("unable to generate catalog with specified fields")
            fields = [x for x in of if x in all_fields]
        else:
            fields = sorted(all_fields, key=self._field_sorter)
        if not opts.connected_device['is_device_connected'] and 'ondevice' in fields:
            fields.pop(int(fields.index('ondevice')))
        return fields
    def initialize(self):
        '''
        If plugin is not a built-in, copy the plugin's .ui and .py files from
        the ZIP file to $TMPDIR.
        Tab will be dynamically generated and added to the Catalog Options dialog in
        calibre.gui2.dialogs.catalog.py:Catalog
        '''
        from calibre.customize.builtins import plugins as builtin_plugins
        from calibre.customize.ui import config
        from calibre.ptempfile import PersistentTemporaryDirectory
        if not type(self) in builtin_plugins and self.name not in config['disabled_plugins']:
            files_to_copy = ["%s.%s" % (self.name.lower(),ext) for ext in ["ui","py"]]
            resources = zipfile.ZipFile(self.plugin_path,'r')
            if self.resources_path is None:
                self.resources_path = PersistentTemporaryDirectory('_plugin_resources', prefix='')
            for file in files_to_copy:
                try:
                    resources.extract(file, self.resources_path)
                except:
                    print(" customize:__init__.initialize(): %s not found in %s" % (file, os.path.basename(self.plugin_path)))
                    continue
            resources.close()
    def run(self, path_to_output, opts, db, ids, notification=None):
        '''
        Run the plugin. Must be implemented in subclasses.
        It should generate the catalog in the format specified
        in file_types, returning the absolute path to the
        generated catalog file. If an error is encountered
        it should raise an Exception.
        The generated catalog file should be created with the
        :meth:`temporary_file` method.
        :param path_to_output: Absolute path to the generated catalog file.
        :param opts: A dictionary of keyword arguments
        :param db: A LibraryDatabase2 object
        '''
        # Default implementation does nothing
        raise NotImplementedError('CatalogPlugin.generate_catalog() default '
                'method, should be overridden in subclass')
 # }}}
 class InterfaceActionBase(Plugin):  # {{{
    supported_platforms = ['windows', 'osx', 'linux']
    author         = 'Kovid Goyal'
    type = _('User interface action')
    can_be_disabled = False
    actual_plugin = None
    def __init__(self, *args, **kwargs):
        Plugin.__init__(self, *args, **kwargs)
        self.actual_plugin_ = None
    def load_actual_plugin(self, gui):
        '''
        This method must return the actual interface action plugin object.
        '''
        ac = self.actual_plugin_
        if ac is None:
            mod, cls = self.actual_plugin.split(':')
            ac = getattr(importlib.import_module(mod), cls)(gui,
                    self.site_customization)
            self.actual_plugin_ = ac
        return ac
 # }}}
 class PreferencesPlugin(Plugin):  # {{{
    '''
    A plugin representing a widget displayed in the Preferences dialog.
    This plugin has only one important method :meth:`create_widget`. The
    various fields of the plugin control how it is categorized in the UI.
    '''
    supported_platforms = ['windows', 'osx', 'linux']
    author         = 'Kovid Goyal'
    type = _('Preferences')
    can_be_disabled = False
    #: Import path to module that contains a class named ConfigWidget
    #: which implements the ConfigWidgetInterface. Used by
    #: :meth:`create_widget`.
    config_widget = None
    #: Where in the list of categories the :attr:`category` of this plugin should be.
    category_order = 100
    #: Where in the list of names in a category, the :attr:`gui_name` of this
    #: plugin should be
    name_order = 100
    #: The category this plugin should be in
    category = None
    #: The category name displayed to the user for this plugin
    gui_category = None
    #: The name displayed to the user for this plugin
    gui_name = None
    #: The icon for this plugin, should be an absolute path
    icon = None
    #: The description used for tooltips and the like
    description = None
    def create_widget(self, parent=None):
        '''
        Create and return the actual Qt widget used for setting this group of
        preferences. The widget must implement the
        :class:`calibre.gui2.preferences.ConfigWidgetInterface`.
        The default implementation uses :attr:`config_widget` to instantiate
        the widget.
        '''
        base, _, wc = self.config_widget.partition(':')
        if not wc:
            wc = 'ConfigWidget'
        base = importlib.import_module(base)
        widget = getattr(base, wc)
        return widget(parent)
 # }}}
 class StoreBase(Plugin):  # {{{
    supported_platforms = ['windows', 'osx', 'linux']
    author         = 'John Schember'
    type = _('Store')
    # Information about the store. Should be in the primary language
    # of the store. This should not be translatable when set by
    # a subclass.
    description = _('An e-book store.')
    minimum_calibre_version = (0, 8, 0)
    version        = (1, 0, 1)
    actual_plugin = None
    # Does the store only distribute e-books without DRM.
    drm_free_only = False
    # This is the 2 letter country code for the corporate
    # headquarters of the store.
    headquarters = ''
    # All formats the store distributes e-books in.
    formats = []
    # Is this store on an affiliate program?
    affiliate = False
    def load_actual_plugin(self, gui):
        '''
        This method must return the actual interface action plugin object.
        '''
        mod, cls = self.actual_plugin.split(':')
        self.actual_plugin_object  = getattr(importlib.import_module(mod), cls)(gui, self.name)
        return self.actual_plugin_object
    def customization_help(self, gui=False):
        if getattr(self, 'actual_plugin_object', None) is not None:
            return self.actual_plugin_object.customization_help(gui)
        raise NotImplementedError()
    def config_widget(self):
        if getattr(self, 'actual_plugin_object', None) is not None:
            return self.actual_plugin_object.config_widget()
        raise NotImplementedError()
    def save_settings(self, config_widget):
        if getattr(self, 'actual_plugin_object', None) is not None:
            return self.actual_plugin_object.save_settings(config_widget)
        raise NotImplementedError()
 # }}}
 class EditBookToolPlugin(Plugin):  # {{{
    type = _('Edit book tool')
    minimum_calibre_version = (1, 46, 0)
 # }}}
 class LibraryClosedPlugin(Plugin):  # {{{
    '''
    LibraryClosedPlugins are run when a library is closed, either at shutdown,
    when the library is changed, or when a library is used in some other way.
    At the moment these plugins won't be called by the CLI functions.
    '''
    type = _('Library closed')
    # minimum version 2.54 because that is when support was added
    minimum_calibre_version = (2, 54, 0)
    def run(self, db):
        '''
        The db will be a reference to the new_api (db.cache.py).
        The plugin must run to completion. It must not use the GUI, threads, or
        any signals.
        '''
        raise NotImplementedError('LibraryClosedPlugin '
                'run method must be overridden in subclass')
 # }}}
--- a/ebook_converter/customize/builtins.py
+++ b/ebook_converter/customize/builtins.py
--- a/ebook_converter/customize/conversion.py
+++ b/ebook_converter/customize/conversion.py
@@ -0,0 +1,376 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 '''
 Defines the plugin system for conversions.
 '''
 import re, os, shutil, numbers
 from calibre import CurrentDir
 from calibre.customize import Plugin
 from polyglot.builtins import unicode_type
 class ConversionOption(object):
    '''
    Class representing conversion options
    '''
    def __init__(self, name=None, help=None, long_switch=None,
                 short_switch=None, choices=None):
        self.name = name
        self.help = help
        self.long_switch = long_switch
        self.short_switch = short_switch
        self.choices = choices
        if self.long_switch is None:
            self.long_switch = self.name.replace('_', '-')
        self.validate_parameters()
    def validate_parameters(self):
        '''
        Validate the parameters passed to :meth:`__init__`.
        '''
        if re.match(r'[a-zA-Z_]([a-zA-Z0-9_])*', self.name) is None:
            raise ValueError(self.name + ' is not a valid Python identifier')
        if not self.help:
            raise ValueError('You must set the help text')
    def __hash__(self):
        return hash(self.name)
    def __eq__(self, other):
        return self.name == getattr(other, 'name', other)
    def clone(self):
        return ConversionOption(name=self.name, help=self.help,
                long_switch=self.long_switch, short_switch=self.short_switch,
                choices=self.choices)
 class OptionRecommendation(object):
    LOW  = 1
    MED  = 2
    HIGH = 3
    def __init__(self, recommended_value=None, level=LOW, **kwargs):
        '''
        An option recommendation. That is, an option as well as its recommended
        value and the level of the recommendation.
        '''
        self.level = level
        self.recommended_value = recommended_value
        self.option = kwargs.pop('option', None)
        if self.option is None:
            self.option = ConversionOption(**kwargs)
        self.validate_parameters()
    @property
    def help(self):
        return self.option.help
    def clone(self):
        return OptionRecommendation(recommended_value=self.recommended_value,
                level=self.level, option=self.option.clone())
    def validate_parameters(self):
        if self.option.choices and self.recommended_value not in \
                                                    self.option.choices:
            raise ValueError('OpRec: %s: Recommended value not in choices'%
                             self.option.name)
        if not (isinstance(self.recommended_value, (numbers.Number, bytes, unicode_type)) or self.recommended_value is None):
            raise ValueError('OpRec: %s:'%self.option.name + repr(
                self.recommended_value) + ' is not a string or a number')
 class DummyReporter(object):
    def __init__(self):
        self.cancel_requested = False
    def __call__(self, percent, msg=''):
        pass
 def gui_configuration_widget(name, parent, get_option_by_name,
        get_option_help, db, book_id, for_output=True):
    import importlib
    def widget_factory(cls):
        return cls(parent, get_option_by_name,
            get_option_help, db, book_id)
    if for_output:
        try:
            output_widget = importlib.import_module(
                    'calibre.gui2.convert.'+name)
            pw = output_widget.PluginWidget
            pw.ICON = I('back.png')
            pw.HELP = _('Options specific to the output format.')
            return widget_factory(pw)
        except ImportError:
            pass
    else:
        try:
            input_widget = importlib.import_module(
                    'calibre.gui2.convert.'+name)
            pw = input_widget.PluginWidget
            pw.ICON = I('forward.png')
            pw.HELP = _('Options specific to the input format.')
            return widget_factory(pw)
        except ImportError:
            pass
    return None
 class InputFormatPlugin(Plugin):
    '''
    InputFormatPlugins are responsible for converting a document into
    HTML+OPF+CSS+etc.
    The results of the conversion *must* be encoded in UTF-8.
    The main action happens in :meth:`convert`.
    '''
    type = _('Conversion input')
    can_be_disabled = False
    supported_platforms = ['windows', 'osx', 'linux']
    commit_name = None  # unique name under which options for this plugin are saved
    ui_data = None
    #: Set of file types for which this plugin should be run
    #: For example: ``set(['azw', 'mobi', 'prc'])``
    file_types     = set()
    #: If True, this input plugin generates a collection of images,
    #: one per HTML file. This can be set dynamically, in the convert method
    #: if the input files can be both image collections and non-image collections.
    #: If you set this to True, you must implement the get_images() method that returns
    #: a list of images.
    is_image_collection = False
    #: Number of CPU cores used by this plugin.
    #: A value of -1 means that it uses all available cores
    core_usage = 1
    #: If set to True, the input plugin will perform special processing
    #: to make its output suitable for viewing
    for_viewer = False
    #: The encoding that this input plugin creates files in. A value of
    #: None means that the encoding is undefined and must be
    #: detected individually
    output_encoding = 'utf-8'
    #: Options shared by all Input format plugins. Do not override
    #: in sub-classes. Use :attr:`options` instead. Every option must be an
    #: instance of :class:`OptionRecommendation`.
    common_options = {
        OptionRecommendation(name='input_encoding',
            recommended_value=None, level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the input document. If '
                   'set this option will override any encoding declared by the '
                   'document itself. Particularly useful for documents that '
                   'do not declare an encoding or that have erroneous '
                   'encoding declarations.')
        )}
    #: Options to customize the behavior of this plugin. Every option must be an
    #: instance of :class:`OptionRecommendation`.
    options = set()
    #: A set of 3-tuples of the form
    #: (option_name, recommended_value, recommendation_level)
    recommendations = set()
    def __init__(self, *args):
        Plugin.__init__(self, *args)
        self.report_progress = DummyReporter()
    def get_images(self):
        '''
        Return a list of absolute paths to the images, if this input plugin
        represents an image collection. The list of images is in the same order
        as the spine and the TOC.
        '''
        raise NotImplementedError()
    def convert(self, stream, options, file_ext, log, accelerators):
        '''
        This method must be implemented in sub-classes. It must return
        the path to the created OPF file or an :class:`OEBBook` instance.
        All output should be contained in the current directory.
        If this plugin creates files outside the current
        directory they must be deleted/marked for deletion before this method
        returns.
        :param stream:   A file like object that contains the input file.
        :param options:  Options to customize the conversion process.
                         Guaranteed to have attributes corresponding
                         to all the options declared by this plugin. In
                         addition, it will have a verbose attribute that
                         takes integral values from zero upwards. Higher numbers
                         mean be more verbose. Another useful attribute is
                         ``input_profile`` that is an instance of
                         :class:`calibre.customize.profiles.InputProfile`.
        :param file_ext: The extension (without the .) of the input file. It
                         is guaranteed to be one of the `file_types` supported
                         by this plugin.
        :param log: A :class:`calibre.utils.logging.Log` object. All output
                    should use this object.
        :param accelarators: A dictionary of various information that the input
                             plugin can get easily that would speed up the
                             subsequent stages of the conversion.
        '''
        raise NotImplementedError()
    def __call__(self, stream, options, file_ext, log,
                 accelerators, output_dir):
        try:
            log('InputFormatPlugin: %s running'%self.name)
            if hasattr(stream, 'name'):
                log('on', stream.name)
        except:
            # In case stdout is broken
            pass
        with CurrentDir(output_dir):
            for x in os.listdir('.'):
                shutil.rmtree(x) if os.path.isdir(x) else os.remove(x)
            ret = self.convert(stream, options, file_ext,
                               log, accelerators)
        return ret
    def postprocess_book(self, oeb, opts, log):
        '''
        Called to allow the input plugin to perform postprocessing after
        the book has been parsed.
        '''
        pass
    def specialize(self, oeb, opts, log, output_fmt):
        '''
        Called to allow the input plugin to specialize the parsed book
        for a particular output format. Called after postprocess_book
        and before any transforms are performed on the parsed book.
        '''
        pass
    def gui_configuration_widget(self, parent, get_option_by_name,
            get_option_help, db, book_id=None):
        '''
        Called to create the widget used for configuring this plugin in the
        calibre GUI. The widget must be an instance of the PluginWidget class.
        See the builtin input plugins for examples.
        '''
        name = self.name.lower().replace(' ', '_')
        return gui_configuration_widget(name, parent, get_option_by_name,
                get_option_help, db, book_id, for_output=False)
 class OutputFormatPlugin(Plugin):
    '''
    OutputFormatPlugins are responsible for converting an OEB document
    (OPF+HTML) into an output e-book.
    The OEB document can be assumed to be encoded in UTF-8.
    The main action happens in :meth:`convert`.
    '''
    type = _('Conversion output')
    can_be_disabled = False
    supported_platforms = ['windows', 'osx', 'linux']
    commit_name = None  # unique name under which options for this plugin are saved
    ui_data = None
    #: The file type (extension without leading period) that this
    #: plugin outputs
    file_type     = None
    #: Options shared by all Input format plugins. Do not override
    #: in sub-classes. Use :attr:`options` instead. Every option must be an
    #: instance of :class:`OptionRecommendation`.
    common_options = {
        OptionRecommendation(name='pretty_print',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('If specified, the output plugin will try to create output '
            'that is as human readable as possible. May not have any effect '
            'for some output plugins.')
        )}
    #: Options to customize the behavior of this plugin. Every option must be an
    #: instance of :class:`OptionRecommendation`.
    options = set()
    #: A set of 3-tuples of the form
    #: (option_name, recommended_value, recommendation_level)
    recommendations = set()
    @property
    def description(self):
        return _('Convert e-books to the %s format')%self.file_type
    def __init__(self, *args):
        Plugin.__init__(self, *args)
        self.report_progress = DummyReporter()
    def convert(self, oeb_book, output, input_plugin, opts, log):
        '''
        Render the contents of `oeb_book` (which is an instance of
        :class:`calibre.ebooks.oeb.OEBBook`) to the file specified by output.
        :param output: Either a file like object or a string. If it is a string
                       it is the path to a directory that may or may not exist. The output
                       plugin should write its output into that directory. If it is a file like
                       object, the output plugin should write its output into the file.
        :param input_plugin: The input plugin that was used at the beginning of
                             the conversion pipeline.
        :param opts: Conversion options. Guaranteed to have attributes
                     corresponding to the OptionRecommendations of this plugin.
        :param log: The logger. Print debug/info messages etc. using this.
        '''
        raise NotImplementedError()
    @property
    def is_periodical(self):
        return self.oeb.metadata.publication_type and \
            unicode_type(self.oeb.metadata.publication_type[0]).startswith('periodical:')
    def specialize_options(self, log, opts, input_fmt):
        '''
        Can be used to change the values of conversion options, as used by the
        conversion pipeline.
        '''
        pass
    def specialize_css_for_output(self, log, opts, item, stylizer):
        '''
        Can be used to make changes to the css during the CSS flattening
        process.
        :param item: The item (HTML file) being processed
        :param stylizer: A Stylizer object containing the flattened styles for
                         item. You can get the style for any element by
                         stylizer.style(element).
        '''
        pass
    def gui_configuration_widget(self, parent, get_option_by_name,
            get_option_help, db, book_id=None):
        '''
        Called to create the widget used for configuring this plugin in the
        calibre GUI. The widget must be an instance of the PluginWidget class.
        See the builtin output plugins for examples.
        '''
        name = self.name.lower().replace(' ', '_')
        return gui_configuration_widget(name, parent, get_option_by_name,
                get_option_help, db, book_id, for_output=True)
--- a/ebook_converter/customize/profiles.py
+++ b/ebook_converter/customize/profiles.py
@@ -0,0 +1,873 @@
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize import Plugin as _Plugin
 from polyglot.builtins import zip
 FONT_SIZES = [('xx-small', 1),
              ('x-small',  None),
              ('small',    2),
              ('medium',   3),
              ('large',    4),
              ('x-large',  5),
              ('xx-large', 6),
              (None,       7)]
 class Plugin(_Plugin):
    fbase  = 12
    fsizes = [5, 7, 9, 12, 13.5, 17, 20, 22, 24]
    screen_size = (1600, 1200)
    dpi = 100
    def __init__(self, *args, **kwargs):
        _Plugin.__init__(self, *args, **kwargs)
        self.width, self.height = self.screen_size
        fsizes = list(self.fsizes)
        self.fkey = list(self.fsizes)
        self.fsizes = []
        for (name, num), size in zip(FONT_SIZES, fsizes):
            self.fsizes.append((name, num, float(size)))
        self.fnames = dict((name, sz) for name, _, sz in self.fsizes if name)
        self.fnums = dict((num, sz) for _, num, sz in self.fsizes if num)
        self.width_pts = self.width * 72./self.dpi
        self.height_pts = self.height * 72./self.dpi
 # Input profiles {{{
 class InputProfile(Plugin):
    author = 'Kovid Goyal'
    supported_platforms = {'windows', 'osx', 'linux'}
    can_be_disabled = False
    type = _('Input profile')
    name        = 'Default Input Profile'
    short_name  = 'default'  # Used in the CLI so dont use spaces etc. in it
    description = _('This profile tries to provide sane defaults and is useful '
                    'if you know nothing about the input document.')
 class SonyReaderInput(InputProfile):
    name        = 'Sony Reader'
    short_name  = 'sony'
    description = _('This profile is intended for the SONY PRS line. '
                    'The 500/505/600/700 etc.')
    screen_size               = (584, 754)
    dpi                       = 168.451
    fbase                     = 12
    fsizes                    = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
 class SonyReader300Input(SonyReaderInput):
    name        = 'Sony Reader 300'
    short_name  = 'sony300'
    description = _('This profile is intended for the SONY PRS 300.')
    dpi                       = 200
 class SonyReader900Input(SonyReaderInput):
    author      = 'John Schember'
    name        = 'Sony Reader 900'
    short_name  = 'sony900'
    description = _('This profile is intended for the SONY PRS-900.')
    screen_size               = (584, 978)
 class MSReaderInput(InputProfile):
    name        = 'Microsoft Reader'
    short_name  = 'msreader'
    description = _('This profile is intended for the Microsoft Reader.')
    screen_size               = (480, 652)
    dpi                       = 96
    fbase                     = 13
    fsizes                    = [10, 11, 13, 16, 18, 20, 22, 26]
 class MobipocketInput(InputProfile):
    name        = 'Mobipocket Books'
    short_name  = 'mobipocket'
    description = _('This profile is intended for the Mobipocket books.')
    # Unfortunately MOBI books are not narrowly targeted, so this information is
    # quite likely to be spurious
    screen_size               = (600, 800)
    dpi                       = 96
    fbase                     = 18
    fsizes                    = [14, 14, 16, 18, 20, 22, 24, 26]
 class HanlinV3Input(InputProfile):
    name        = 'Hanlin V3'
    short_name  = 'hanlinv3'
    description = _('This profile is intended for the Hanlin V3 and its clones.')
    # Screen size is a best guess
    screen_size               = (584, 754)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class HanlinV5Input(HanlinV3Input):
    name        = 'Hanlin V5'
    short_name  = 'hanlinv5'
    description = _('This profile is intended for the Hanlin V5 and its clones.')
    # Screen size is a best guess
    screen_size               = (584, 754)
    dpi                       = 200
 class CybookG3Input(InputProfile):
    name        = 'Cybook G3'
    short_name  = 'cybookg3'
    description = _('This profile is intended for the Cybook G3.')
    # Screen size is a best guess
    screen_size               = (600, 800)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class CybookOpusInput(InputProfile):
    author      = 'John Schember'
    name        = 'Cybook Opus'
    short_name  = 'cybook_opus'
    description = _('This profile is intended for the Cybook Opus.')
    # Screen size is a best guess
    screen_size               = (600, 800)
    dpi                       = 200
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class KindleInput(InputProfile):
    name        = 'Kindle'
    short_name  = 'kindle'
    description = _('This profile is intended for the Amazon Kindle.')
    # Screen size is a best guess
    screen_size               = (525, 640)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class IlliadInput(InputProfile):
    name        = 'Illiad'
    short_name  = 'illiad'
    description = _('This profile is intended for the Irex Illiad.')
    screen_size               = (760, 925)
    dpi                       = 160.0
    fbase                     = 12
    fsizes                    = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
 class IRexDR1000Input(InputProfile):
    author      = 'John Schember'
    name        = 'IRex Digital Reader 1000'
    short_name  = 'irexdr1000'
    description = _('This profile is intended for the IRex Digital Reader 1000.')
    # Screen size is a best guess
    screen_size               = (1024, 1280)
    dpi                       = 160
    fbase                     = 16
    fsizes                    = [12, 14, 16, 18, 20, 22, 24]
 class IRexDR800Input(InputProfile):
    author      = 'Eric Cronin'
    name        = 'IRex Digital Reader 800'
    short_name  = 'irexdr800'
    description = _('This profile is intended for the IRex Digital Reader 800.')
    screen_size               = (768, 1024)
    dpi                       = 160
    fbase                     = 16
    fsizes                    = [12, 14, 16, 18, 20, 22, 24]
 class NookInput(InputProfile):
    author      = 'John Schember'
    name        = 'Nook'
    short_name  = 'nook'
    description = _('This profile is intended for the B&N Nook.')
    # Screen size is a best guess
    screen_size               = (600, 800)
    dpi                       = 167
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 input_profiles = [InputProfile, SonyReaderInput, SonyReader300Input,
        SonyReader900Input, MSReaderInput, MobipocketInput, HanlinV3Input,
        HanlinV5Input, CybookG3Input, CybookOpusInput, KindleInput, IlliadInput,
        IRexDR1000Input, IRexDR800Input, NookInput]
 input_profiles.sort(key=lambda x: x.name.lower())
 # }}}
 class OutputProfile(Plugin):
    author = 'Kovid Goyal'
    supported_platforms = {'windows', 'osx', 'linux'}
    can_be_disabled = False
    type = _('Output profile')
    name        = 'Default Output Profile'
    short_name  = 'default'  # Used in the CLI so dont use spaces etc. in it
    description = _('This profile tries to provide sane defaults and is useful '
                    'if you want to produce a document intended to be read at a '
                    'computer or on a range of devices.')
    #: The image size for comics
    comic_screen_size = (584, 754)
    #: If True the MOBI renderer on the device supports MOBI indexing
    supports_mobi_indexing = False
    #: If True output should be optimized for a touchscreen interface
    touchscreen = False
    touchscreen_news_css = ''
    #: A list of extra (beyond CSS 2.1) modules supported by the device
    #: Format is a css_parser profile dictionary (see iPad for example)
    extra_css_modules = []
    #: If True, the date is appended to the title of downloaded news
    periodical_date_in_title = True
    #: Characters used in jackets and catalogs
    ratings_char = '*'
    empty_ratings_char = ' '
    #: Unsupported unicode characters to be replaced during preprocessing
    unsupported_unicode_chars = []
    #: Number of ems that the left margin of a blockquote is rendered as
    mobi_ems_per_blockquote = 1.0
    #: Special periodical formatting needed in EPUB
    epub_periodical_format = None
 class iPadOutput(OutputProfile):
    name = 'iPad'
    short_name = 'ipad'
    description = _('Intended for the iPad and similar devices with a '
            'resolution of 768x1024')
    screen_size = (768, 1024)
    comic_screen_size = (768, 1024)
    dpi = 132.0
    extra_css_modules = [
        {
            'name':'webkit',
            'props': {'-webkit-border-bottom-left-radius':'{length}',
                '-webkit-border-bottom-right-radius':'{length}',
                '-webkit-border-top-left-radius':'{length}',
                '-webkit-border-top-right-radius':'{length}',
                '-webkit-border-radius': r'{border-width}(\s+{border-width}){0,3}|inherit',
            },
            'macros': {'border-width': '{length}|medium|thick|thin'}
        }
    ]
    ratings_char = '\u2605'            # filled star
    empty_ratings_char = '\u2606'      # hollow star
    touchscreen = True
    # touchscreen_news_css {{{
    touchscreen_news_css = '''
            /* hr used in articles */
            .article_articles_list {
                width:18%;
                }
            .article_link {
                color: #593f29;
                font-style: italic;
                }
            .article_next {
                -webkit-border-top-right-radius:4px;
                -webkit-border-bottom-right-radius:4px;
                font-style: italic;
                width:32%;
                }
            .article_prev {
                -webkit-border-top-left-radius:4px;
                -webkit-border-bottom-left-radius:4px;
                font-style: italic;
                width:32%;
                }
            .article_sections_list {
                width:18%;
                }
            .articles_link {
                font-weight: bold;
                }
            .sections_link {
                font-weight: bold;
                }
            .caption_divider {
                border:#ccc 1px solid;
                }
            .touchscreen_navbar {
                background:#c3bab2;
                border:#ccc 0px solid;
                border-collapse:separate;
                border-spacing:1px;
                margin-left: 5%;
                margin-right: 5%;
                page-break-inside:avoid;
                width: 90%;
                -webkit-border-radius:4px;
                }
            .touchscreen_navbar td {
                background:#fff;
                font-family:Helvetica;
                font-size:80%;
                /* UI touchboxes use 8px padding */
                padding: 6px;
                text-align:center;
                }
            .touchscreen_navbar td a:link {
                color: #593f29;
                text-decoration: none;
                }
            /* Index formatting */
            .publish_date {
                text-align:center;
                }
            .divider {
                border-bottom:1em solid white;
                border-top:1px solid gray;
                }
            hr.caption_divider {
                border-color:black;
                border-style:solid;
                border-width:1px;
                }
            /* Feed summary formatting */
            .article_summary {
                display:inline-block;
                padding-bottom:0.5em;
                }
            .feed {
                font-family:sans-serif;
                font-weight:bold;
                font-size:larger;
                }
            .feed_link {
                font-style: italic;
                }
            .feed_next {
                -webkit-border-top-right-radius:4px;
                -webkit-border-bottom-right-radius:4px;
                font-style: italic;
                width:40%;
                }
            .feed_prev {
                -webkit-border-top-left-radius:4px;
                -webkit-border-bottom-left-radius:4px;
                font-style: italic;
                width:40%;
                }
            .feed_title {
                text-align: center;
                font-size: 160%;
                }
            .feed_up {
                font-weight: bold;
                width:20%;
                }
            .summary_headline {
                font-weight:bold;
                text-align:left;
                }
            .summary_byline {
                text-align:left;
                font-family:monospace;
                }
            .summary_text {
                text-align:left;
                }
        '''
    # }}}
 class iPad3Output(iPadOutput):
    screen_size = comic_screen_size = (2048, 1536)
    dpi = 264.0
    name = 'iPad 3'
    short_name = 'ipad3'
    description = _('Intended for the iPad 3 and similar devices with a '
            'resolution of 1536x2048')
 class TabletOutput(iPadOutput):
    name = 'Tablet'
    short_name = 'tablet'
    description = _('Intended for generic tablet devices, does no resizing of images')
    screen_size = (10000, 10000)
    comic_screen_size = (10000, 10000)
 class SamsungGalaxy(TabletOutput):
    name = 'Samsung Galaxy'
    short_name = 'galaxy'
    description = _('Intended for the Samsung Galaxy and similar tablet devices with '
            'a resolution of 600x1280')
    screen_size = comic_screen_size = (600, 1280)
 class NookHD(TabletOutput):
    name = 'Nook HD+'
    short_name = 'nook_hd_plus'
    description = _('Intended for the Nook HD+ and similar tablet devices with '
            'a resolution of 1280x1920')
    screen_size = comic_screen_size = (1280, 1920)
 class SonyReaderOutput(OutputProfile):
    name        = 'Sony Reader'
    short_name  = 'sony'
    description = _('This profile is intended for the SONY PRS line. '
                    'The 500/505/600/700 etc.')
    screen_size               = (590, 775)
    dpi                       = 168.451
    fbase                     = 12
    fsizes                    = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
    unsupported_unicode_chars = [u'\u201f', u'\u201b']
    epub_periodical_format = 'sony'
    # periodical_date_in_title = False
 class KoboReaderOutput(OutputProfile):
    name = 'Kobo Reader'
    short_name = 'kobo'
    description = _('This profile is intended for the Kobo Reader.')
    screen_size               = (536, 710)
    comic_screen_size         = (536, 710)
    dpi                       = 168.451
    fbase                     = 12
    fsizes                    = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
 class SonyReader300Output(SonyReaderOutput):
    author      = 'John Schember'
    name        = 'Sony Reader 300'
    short_name  = 'sony300'
    description = _('This profile is intended for the SONY PRS-300.')
    dpi                       = 200
 class SonyReader900Output(SonyReaderOutput):
    author      = 'John Schember'
    name        = 'Sony Reader 900'
    short_name  = 'sony900'
    description = _('This profile is intended for the SONY PRS-900.')
    screen_size               = (600, 999)
    comic_screen_size = screen_size
 class SonyReaderT3Output(SonyReaderOutput):
    author = 'Kovid Goyal'
    name        = 'Sony Reader T3'
    short_name  = 'sonyt3'
    description = _('This profile is intended for the SONY PRS-T3.')
    screen_size               = (758, 934)
    comic_screen_size = screen_size
 class GenericEink(SonyReaderOutput):
    name = 'Generic e-ink'
    short_name = 'generic_eink'
    description = _('Suitable for use with any e-ink device')
    epub_periodical_format = None
 class GenericEinkLarge(GenericEink):
    name = 'Generic e-ink large'
    short_name = 'generic_eink_large'
    description = _('Suitable for use with any large screen e-ink device')
    screen_size               = (600, 999)
    comic_screen_size = screen_size
 class GenericEinkHD(GenericEink):
    name = 'Generic e-ink HD'
    short_name = 'generic_eink_hd'
    description = _('Suitable for use with any modern high resolution e-ink device')
    screen_size = (10000, 10000)
    comic_screen_size = (10000, 10000)
 class JetBook5Output(OutputProfile):
    name        = 'JetBook 5-inch'
    short_name  = 'jetbook5'
    description = _('This profile is intended for the 5-inch JetBook.')
    screen_size               = (480, 640)
    dpi                       = 168.451
 class SonyReaderLandscapeOutput(SonyReaderOutput):
    name        = 'Sony Reader Landscape'
    short_name  = 'sony-landscape'
    description = _('This profile is intended for the SONY PRS line. '
                    'The 500/505/700 etc, in landscape mode. Mainly useful '
                    'for comics.')
    screen_size               = (784, 1012)
    comic_screen_size         = (784, 1012)
 class MSReaderOutput(OutputProfile):
    name        = 'Microsoft Reader'
    short_name  = 'msreader'
    description = _('This profile is intended for the Microsoft Reader.')
    screen_size               = (480, 652)
    dpi                       = 96
    fbase                     = 13
    fsizes                    = [10, 11, 13, 16, 18, 20, 22, 26]
 class MobipocketOutput(OutputProfile):
    name        = 'Mobipocket Books'
    short_name  = 'mobipocket'
    description = _('This profile is intended for the Mobipocket books.')
    # Unfortunately MOBI books are not narrowly targeted, so this information is
    # quite likely to be spurious
    screen_size               = (600, 800)
    dpi                       = 96
    fbase                     = 18
    fsizes                    = [14, 14, 16, 18, 20, 22, 24, 26]
 class HanlinV3Output(OutputProfile):
    name        = 'Hanlin V3'
    short_name  = 'hanlinv3'
    description = _('This profile is intended for the Hanlin V3 and its clones.')
    # Screen size is a best guess
    screen_size               = (584, 754)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class HanlinV5Output(HanlinV3Output):
    name        = 'Hanlin V5'
    short_name  = 'hanlinv5'
    description = _('This profile is intended for the Hanlin V5 and its clones.')
    dpi                       = 200
 class CybookG3Output(OutputProfile):
    name        = 'Cybook G3'
    short_name  = 'cybookg3'
    description = _('This profile is intended for the Cybook G3.')
    # Screen size is a best guess
    screen_size               = (600, 800)
    comic_screen_size         = (600, 757)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class CybookOpusOutput(SonyReaderOutput):
    author      = 'John Schember'
    name        = 'Cybook Opus'
    short_name  = 'cybook_opus'
    description = _('This profile is intended for the Cybook Opus.')
    # Screen size is a best guess
    dpi                       = 200
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
    epub_periodical_format = None
 class KindleOutput(OutputProfile):
    name        = 'Kindle'
    short_name  = 'kindle'
    description = _('This profile is intended for the Amazon Kindle.')
    # Screen size is a best guess
    screen_size               = (525, 640)
    dpi                       = 168.451
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
    supports_mobi_indexing = True
    periodical_date_in_title = False
    empty_ratings_char = '\u2606'
    ratings_char = '\u2605'
    mobi_ems_per_blockquote = 2.0
 class KindleDXOutput(OutputProfile):
    name        = 'Kindle DX'
    short_name  = 'kindle_dx'
    description = _('This profile is intended for the Amazon Kindle DX.')
    # Screen size is a best guess
    screen_size               = (744, 1022)
    dpi                       = 150.0
    comic_screen_size = (771, 1116)
    # comic_screen_size         = (741, 1022)
    supports_mobi_indexing = True
    periodical_date_in_title = False
    empty_ratings_char = '\u2606'
    ratings_char = '\u2605'
    mobi_ems_per_blockquote = 2.0
 class KindlePaperWhiteOutput(KindleOutput):
    name = 'Kindle PaperWhite'
    short_name = 'kindle_pw'
    description = _('This profile is intended for the Amazon Kindle PaperWhite 1 and 2')
    # Screen size is a best guess
    screen_size               = (658, 940)
    dpi                       = 212.0
    comic_screen_size = screen_size
 class KindleVoyageOutput(KindleOutput):
    name = 'Kindle Voyage'
    short_name = 'kindle_voyage'
    description = _('This profile is intended for the Amazon Kindle Voyage')
    # Screen size is currently just the spec size, actual renderable area will
    # depend on someone with the device doing tests.
    screen_size               = (1080, 1430)
    dpi                       = 300.0
    comic_screen_size = screen_size
 class KindlePaperWhite3Output(KindleVoyageOutput):
    name = 'Kindle PaperWhite 3'
    short_name = 'kindle_pw3'
    description = _('This profile is intended for the Amazon Kindle PaperWhite 3 and above')
    # Screen size is currently just the spec size, actual renderable area will
    # depend on someone with the device doing tests.
    screen_size               = (1072, 1430)
    dpi                       = 300.0
    comic_screen_size = screen_size
 class KindleOasisOutput(KindlePaperWhite3Output):
    name = 'Kindle Oasis'
    short_name = 'kindle_oasis'
    description = _('This profile is intended for the Amazon Kindle Oasis 2017 and above')
    # Screen size is currently just the spec size, actual renderable area will
    # depend on someone with the device doing tests.
    screen_size               = (1264, 1680)
    dpi                       = 300.0
    comic_screen_size = screen_size
 class KindleFireOutput(KindleDXOutput):
    name = 'Kindle Fire'
    short_name = 'kindle_fire'
    description = _('This profile is intended for the Amazon Kindle Fire.')
    screen_size               = (570, 1016)
    dpi                       = 169.0
    comic_screen_size = (570, 1016)
 class IlliadOutput(OutputProfile):
    name        = 'Illiad'
    short_name  = 'illiad'
    description = _('This profile is intended for the Irex Illiad.')
    screen_size               = (760, 925)
    comic_screen_size         = (760, 925)
    dpi                       = 160.0
    fbase                     = 12
    fsizes                    = [7.5, 9, 10, 12, 15.5, 20, 22, 24]
 class IRexDR1000Output(OutputProfile):
    author      = 'John Schember'
    name        = 'IRex Digital Reader 1000'
    short_name  = 'irexdr1000'
    description = _('This profile is intended for the IRex Digital Reader 1000.')
    # Screen size is a best guess
    screen_size               = (1024, 1280)
    comic_screen_size         = (996, 1241)
    dpi                       = 160
    fbase                     = 16
    fsizes                    = [12, 14, 16, 18, 20, 22, 24]
 class IRexDR800Output(OutputProfile):
    author      = 'Eric Cronin'
    name        = 'IRex Digital Reader 800'
    short_name  = 'irexdr800'
    description = _('This profile is intended for the IRex Digital Reader 800.')
    # Screen size is a best guess
    screen_size               = (768, 1024)
    comic_screen_size         = (768, 1024)
    dpi                       = 160
    fbase                     = 16
    fsizes                    = [12, 14, 16, 18, 20, 22, 24]
 class NookOutput(OutputProfile):
    author      = 'John Schember'
    name        = 'Nook'
    short_name  = 'nook'
    description = _('This profile is intended for the B&N Nook.')
    # Screen size is a best guess
    screen_size               = (600, 730)
    comic_screen_size         = (584, 730)
    dpi                       = 167
    fbase                     = 16
    fsizes                    = [12, 12, 14, 16, 18, 20, 22, 24]
 class NookColorOutput(NookOutput):
    name = 'Nook Color'
    short_name = 'nook_color'
    description = _('This profile is intended for the B&N Nook Color.')
    screen_size               = (600, 900)
    comic_screen_size         = (594, 900)
    dpi                       = 169
 class PocketBook900Output(OutputProfile):
    author = 'Chris Lockfort'
    name = 'PocketBook Pro 900'
    short_name = 'pocketbook_900'
    description = _('This profile is intended for the PocketBook Pro 900 series of devices.')
    screen_size               = (810, 1180)
    dpi                       = 150.0
    comic_screen_size         = screen_size
 class PocketBookPro912Output(OutputProfile):
    author = 'Daniele Pizzolli'
    name = 'PocketBook Pro 912'
    short_name = 'pocketbook_pro_912'
    description = _('This profile is intended for the PocketBook Pro 912 series of devices.')
    # According to http://download.pocketbook-int.com/user-guides/E_Ink/912/User_Guide_PocketBook_912(EN).pdf
    screen_size               = (825, 1200)
    dpi                       = 155.0
    comic_screen_size         = screen_size
 output_profiles = [
    OutputProfile, SonyReaderOutput, SonyReader300Output, SonyReader900Output,
    SonyReaderT3Output, MSReaderOutput, MobipocketOutput, HanlinV3Output,
    HanlinV5Output, CybookG3Output, CybookOpusOutput, KindleOutput, iPadOutput,
    iPad3Output, KoboReaderOutput, TabletOutput, SamsungGalaxy,
    SonyReaderLandscapeOutput, KindleDXOutput, IlliadOutput, NookHD,
    IRexDR1000Output, IRexDR800Output, JetBook5Output, NookOutput,
    NookColorOutput, PocketBook900Output,
    PocketBookPro912Output, GenericEink, GenericEinkLarge, GenericEinkHD,
    KindleFireOutput, KindlePaperWhiteOutput, KindleVoyageOutput,
    KindlePaperWhite3Output, KindleOasisOutput
 ]
 output_profiles.sort(key=lambda x: x.name.lower())
--- a/ebook_converter/customize/ui.py
+++ b/ebook_converter/customize/ui.py
@@ -0,0 +1,835 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import os, shutil, traceback, functools, sys
 from collections import defaultdict
 from itertools import chain
 from calibre.customize import (CatalogPlugin, FileTypePlugin, PluginNotFound,
                              MetadataReaderPlugin, MetadataWriterPlugin,
                              InterfaceActionBase as InterfaceAction,
                              PreferencesPlugin, platform, InvalidPlugin,
                              StoreBase as Store, EditBookToolPlugin,
                              LibraryClosedPlugin)
 from calibre.customize.conversion import InputFormatPlugin, OutputFormatPlugin
 from calibre.customize.zipplugin import loader
 from calibre.customize.profiles import InputProfile, OutputProfile
 from calibre.customize.builtins import plugins as builtin_plugins
 from calibre.devices.interface import DevicePlugin
 from calibre.ebooks.metadata import MetaInformation
 from calibre.utils.config import (make_config_dir, Config, ConfigProxy,
                                 plugin_dir, OptionParser)
 from calibre.ebooks.metadata.sources.base import Source
 from calibre.constants import DEBUG, numeric_version
 from polyglot.builtins import iteritems, itervalues, unicode_type
 builtin_names = frozenset(p.name for p in builtin_plugins)
 BLACKLISTED_PLUGINS = frozenset({'Marvin XD', 'iOS reader applications'})
 class NameConflict(ValueError):
    pass
 def _config():
    c = Config('customize')
    c.add_opt('plugins', default={}, help=_('Installed plugins'))
    c.add_opt('filetype_mapping', default={}, help=_('Mapping for filetype plugins'))
    c.add_opt('plugin_customization', default={}, help=_('Local plugin customization'))
    c.add_opt('disabled_plugins', default=set(), help=_('Disabled plugins'))
    c.add_opt('enabled_plugins', default=set(), help=_('Enabled plugins'))
    return ConfigProxy(c)
 config = _config()
 def find_plugin(name):
    for plugin in _initialized_plugins:
        if plugin.name == name:
            return plugin
 def load_plugin(path_to_zip_file):  # {{{
    '''
    Load plugin from ZIP file or raise InvalidPlugin error
    :return: A :class:`Plugin` instance.
    '''
    return loader.load(path_to_zip_file)
 # }}}
 # Enable/disable plugins {{{
 def disable_plugin(plugin_or_name):
    x = getattr(plugin_or_name, 'name', plugin_or_name)
    plugin = find_plugin(x)
    if not plugin.can_be_disabled:
        raise ValueError('Plugin %s cannot be disabled'%x)
    dp = config['disabled_plugins']
    dp.add(x)
    config['disabled_plugins'] = dp
    ep = config['enabled_plugins']
    if x in ep:
        ep.remove(x)
    config['enabled_plugins'] = ep
 def enable_plugin(plugin_or_name):
    x = getattr(plugin_or_name, 'name', plugin_or_name)
    dp = config['disabled_plugins']
    if x in dp:
        dp.remove(x)
    config['disabled_plugins'] = dp
    ep = config['enabled_plugins']
    ep.add(x)
    config['enabled_plugins'] = ep
 def restore_plugin_state_to_default(plugin_or_name):
    x = getattr(plugin_or_name, 'name', plugin_or_name)
    dp = config['disabled_plugins']
    if x in dp:
        dp.remove(x)
    config['disabled_plugins'] = dp
    ep = config['enabled_plugins']
    if x in ep:
        ep.remove(x)
    config['enabled_plugins'] = ep
 default_disabled_plugins = {
    'Overdrive', 'Douban Books', 'OZON.ru', 'Edelweiss', 'Google Images', 'Big Book Search',
 }
 def is_disabled(plugin):
    if plugin.name in config['enabled_plugins']:
        return False
    return plugin.name in config['disabled_plugins'] or \
            plugin.name in default_disabled_plugins
 # }}}
 # File type plugins {{{
 _on_import           = {}
 _on_postimport       = {}
 _on_preprocess       = {}
 _on_postprocess      = {}
 _on_postadd          = []
 def reread_filetype_plugins():
    global _on_import, _on_postimport, _on_preprocess, _on_postprocess, _on_postadd
    _on_import           = defaultdict(list)
    _on_postimport       = defaultdict(list)
    _on_preprocess       = defaultdict(list)
    _on_postprocess      = defaultdict(list)
    _on_postadd          = []
    for plugin in _initialized_plugins:
        if isinstance(plugin, FileTypePlugin):
            for ft in plugin.file_types:
                if plugin.on_import:
                    _on_import[ft].append(plugin)
                if plugin.on_postimport:
                    _on_postimport[ft].append(plugin)
                    _on_postadd.append(plugin)
                if plugin.on_preprocess:
                    _on_preprocess[ft].append(plugin)
                if plugin.on_postprocess:
                    _on_postprocess[ft].append(plugin)
 def plugins_for_ft(ft, occasion):
    op = {
        'import':_on_import, 'preprocess':_on_preprocess, 'postprocess':_on_postprocess, 'postimport':_on_postimport,
    }[occasion]
    for p in chain(op.get(ft, ()), op.get('*', ())):
        if not is_disabled(p):
            yield p
 def _run_filetype_plugins(path_to_file, ft=None, occasion='preprocess'):
    customization = config['plugin_customization']
    if ft is None:
        ft = os.path.splitext(path_to_file)[-1].lower().replace('.', '')
    nfp = path_to_file
    for plugin in plugins_for_ft(ft, occasion):
        plugin.site_customization = customization.get(plugin.name, '')
        oo, oe = sys.stdout, sys.stderr  # Some file type plugins out there override the output streams with buggy implementations
        with plugin:
            try:
                plugin.original_path_to_file = path_to_file
            except Exception:
                pass
            try:
                nfp = plugin.run(nfp) or nfp
            except:
                print('Running file type plugin %s failed with traceback:'%plugin.name, file=oe)
                traceback.print_exc(file=oe)
        sys.stdout, sys.stderr = oo, oe
    x = lambda j: os.path.normpath(os.path.normcase(j))
    if occasion == 'postprocess' and x(nfp) != x(path_to_file):
        shutil.copyfile(nfp, path_to_file)
        nfp = path_to_file
    return nfp
 run_plugins_on_import      = functools.partial(_run_filetype_plugins, occasion='import')
 run_plugins_on_preprocess  = functools.partial(_run_filetype_plugins, occasion='preprocess')
 run_plugins_on_postprocess = functools.partial(_run_filetype_plugins, occasion='postprocess')
 def run_plugins_on_postimport(db, book_id, fmt):
    customization = config['plugin_customization']
    fmt = fmt.lower()
    for plugin in plugins_for_ft(fmt, 'postimport'):
        plugin.site_customization = customization.get(plugin.name, '')
        with plugin:
            try:
                plugin.postimport(book_id, fmt, db)
            except:
                print('Running file type plugin %s failed with traceback:'%
                       plugin.name)
                traceback.print_exc()
 def run_plugins_on_postadd(db, book_id, fmt_map):
    customization = config['plugin_customization']
    for plugin in _on_postadd:
        if is_disabled(plugin):
            continue
        plugin.site_customization = customization.get(plugin.name, '')
        with plugin:
            try:
                plugin.postadd(book_id, fmt_map, db)
            except Exception:
                print('Running file type plugin %s failed with traceback:'%
                       plugin.name)
                traceback.print_exc()
 # }}}
 # Plugin customization {{{
 def customize_plugin(plugin, custom):
    d = config['plugin_customization']
    d[plugin.name] = custom.strip()
    config['plugin_customization'] = d
 def plugin_customization(plugin):
    return config['plugin_customization'].get(plugin.name, '')
 # }}}
 # Input/Output profiles {{{
 def input_profiles():
    for plugin in _initialized_plugins:
        if isinstance(plugin, InputProfile):
            yield plugin
 def output_profiles():
    for plugin in _initialized_plugins:
        if isinstance(plugin, OutputProfile):
            yield plugin
 # }}}
 # Interface Actions # {{{
 def interface_actions():
    customization = config['plugin_customization']
    for plugin in _initialized_plugins:
        if isinstance(plugin, InterfaceAction):
            if not is_disabled(plugin):
                plugin.site_customization = customization.get(plugin.name, '')
                yield plugin
 # }}}
 # Preferences Plugins # {{{
 def preferences_plugins():
    customization = config['plugin_customization']
    for plugin in _initialized_plugins:
        if isinstance(plugin, PreferencesPlugin):
            if not is_disabled(plugin):
                plugin.site_customization = customization.get(plugin.name, '')
                yield plugin
 # }}}
 # Library Closed Plugins # {{{
 def available_library_closed_plugins():
    customization = config['plugin_customization']
    for plugin in _initialized_plugins:
        if isinstance(plugin, LibraryClosedPlugin):
            if not is_disabled(plugin):
                plugin.site_customization = customization.get(plugin.name, '')
                yield plugin
 def has_library_closed_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, LibraryClosedPlugin):
            if not is_disabled(plugin):
                return True
    return False
 # }}}
 # Store Plugins # {{{
 def store_plugins():
    customization = config['plugin_customization']
    for plugin in _initialized_plugins:
        if isinstance(plugin, Store):
            plugin.site_customization = customization.get(plugin.name, '')
            yield plugin
 def available_store_plugins():
    for plugin in store_plugins():
        if not is_disabled(plugin):
            yield plugin
 def stores():
    stores = set()
    for plugin in store_plugins():
        stores.add(plugin.name)
    return stores
 def available_stores():
    stores = set()
    for plugin in available_store_plugins():
        stores.add(plugin.name)
    return stores
 # }}}
 # Metadata read/write {{{
 _metadata_readers = {}
 _metadata_writers = {}
 def reread_metadata_plugins():
    global _metadata_readers
    global _metadata_writers
    _metadata_readers = defaultdict(list)
    _metadata_writers = defaultdict(list)
    for plugin in _initialized_plugins:
        if isinstance(plugin, MetadataReaderPlugin):
            for ft in plugin.file_types:
                _metadata_readers[ft].append(plugin)
        elif isinstance(plugin, MetadataWriterPlugin):
            for ft in plugin.file_types:
                _metadata_writers[ft].append(plugin)
    # Ensure custom metadata plugins are used in preference to builtin
    # ones for a given filetype
    def key(plugin):
        return (1 if plugin.plugin_path is None else 0), plugin.name
    for group in (_metadata_readers, _metadata_writers):
        for plugins in itervalues(group):
            if len(plugins) > 1:
                plugins.sort(key=key)
 def metadata_readers():
    ans = set()
    for plugins in _metadata_readers.values():
        for plugin in plugins:
            ans.add(plugin)
    return ans
 def metadata_writers():
    ans = set()
    for plugins in _metadata_writers.values():
        for plugin in plugins:
            ans.add(plugin)
    return ans
 class QuickMetadata(object):
    def __init__(self):
        self.quick = False
    def __enter__(self):
        self.quick = True
    def __exit__(self, *args):
        self.quick = False
 quick_metadata = QuickMetadata()
 class ApplyNullMetadata(object):
    def __init__(self):
        self.apply_null = False
    def __enter__(self):
        self.apply_null = True
    def __exit__(self, *args):
        self.apply_null = False
 apply_null_metadata = ApplyNullMetadata()
 class ForceIdentifiers(object):
    def __init__(self):
        self.force_identifiers = False
    def __enter__(self):
        self.force_identifiers = True
    def __exit__(self, *args):
        self.force_identifiers = False
 force_identifiers = ForceIdentifiers()
 def get_file_type_metadata(stream, ftype):
    mi = MetaInformation(None, None)
    ftype = ftype.lower().strip()
    if ftype in _metadata_readers:
        for plugin in _metadata_readers[ftype]:
            if not is_disabled(plugin):
                with plugin:
                    try:
                        plugin.quick = quick_metadata.quick
                        if hasattr(stream, 'seek'):
                            stream.seek(0)
                        mi = plugin.get_metadata(stream, ftype.lower().strip())
                        break
                    except:
                        traceback.print_exc()
                        continue
    return mi
 def set_file_type_metadata(stream, mi, ftype, report_error=None):
    ftype = ftype.lower().strip()
    if ftype in _metadata_writers:
        customization = config['plugin_customization']
        for plugin in _metadata_writers[ftype]:
            if not is_disabled(plugin):
                with plugin:
                    try:
                        plugin.apply_null = apply_null_metadata.apply_null
                        plugin.force_identifiers = force_identifiers.force_identifiers
                        plugin.site_customization = customization.get(plugin.name, '')
                        plugin.set_metadata(stream, mi, ftype.lower().strip())
                        break
                    except:
                        if report_error is None:
                            from calibre import prints
                            prints('Failed to set metadata for the', ftype.upper(), 'format of:', getattr(mi, 'title', ''), file=sys.stderr)
                            traceback.print_exc()
                        else:
                            report_error(mi, ftype, traceback.format_exc())
 def can_set_metadata(ftype):
    ftype = ftype.lower().strip()
    for plugin in _metadata_writers.get(ftype, ()):
        if not is_disabled(plugin):
            return True
    return False
 # }}}
 # Add/remove plugins {{{
 def add_plugin(path_to_zip_file):
    make_config_dir()
    plugin = load_plugin(path_to_zip_file)
    if plugin.name in builtin_names:
        raise NameConflict(
            'A builtin plugin with the name %r already exists' % plugin.name)
    plugin = initialize_plugin(plugin, path_to_zip_file)
    plugins = config['plugins']
    zfp = os.path.join(plugin_dir, plugin.name+'.zip')
    if os.path.exists(zfp):
        os.remove(zfp)
    shutil.copyfile(path_to_zip_file, zfp)
    plugins[plugin.name] = zfp
    config['plugins'] = plugins
    initialize_plugins()
    return plugin
 def remove_plugin(plugin_or_name):
    name = getattr(plugin_or_name, 'name', plugin_or_name)
    plugins = config['plugins']
    removed = False
    if name in plugins:
        removed = True
        try:
            zfp = os.path.join(plugin_dir, name+'.zip')
            if os.path.exists(zfp):
                os.remove(zfp)
            zfp = plugins[name]
            if os.path.exists(zfp):
                os.remove(zfp)
        except:
            pass
        plugins.pop(name)
    config['plugins'] = plugins
    initialize_plugins()
    return removed
 # }}}
 # Input/Output format plugins {{{
 def input_format_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, InputFormatPlugin):
            yield plugin
 def plugin_for_input_format(fmt):
    customization = config['plugin_customization']
    for plugin in input_format_plugins():
        if fmt.lower() in plugin.file_types:
            plugin.site_customization = customization.get(plugin.name, None)
            return plugin
 def all_input_formats():
    formats = set()
    for plugin in input_format_plugins():
        for format in plugin.file_types:
            formats.add(format)
    return formats
 def available_input_formats():
    formats = set()
    for plugin in input_format_plugins():
        if not is_disabled(plugin):
            for format in plugin.file_types:
                formats.add(format)
    formats.add('zip'), formats.add('rar')
    return formats
 def output_format_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, OutputFormatPlugin):
            yield plugin
 def plugin_for_output_format(fmt):
    customization = config['plugin_customization']
    for plugin in output_format_plugins():
        if fmt.lower() == plugin.file_type:
            plugin.site_customization = customization.get(plugin.name, None)
            return plugin
 def available_output_formats():
    formats = set()
    for plugin in output_format_plugins():
        if not is_disabled(plugin):
            formats.add(plugin.file_type)
    return formats
 # }}}
 # Catalog plugins {{{
 def catalog_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, CatalogPlugin):
            yield plugin
 def available_catalog_formats():
    formats = set()
    for plugin in catalog_plugins():
        if not is_disabled(plugin):
            for format in plugin.file_types:
                formats.add(format)
    return formats
 def plugin_for_catalog_format(fmt):
    for plugin in catalog_plugins():
        if fmt.lower() in plugin.file_types:
            return plugin
 # }}}
 # Device plugins {{{
 def device_plugins(include_disabled=False):
    for plugin in _initialized_plugins:
        if isinstance(plugin, DevicePlugin):
            if include_disabled or not is_disabled(plugin):
                if platform in plugin.supported_platforms:
                    if getattr(plugin, 'plugin_needs_delayed_initialization',
                            False):
                        plugin.do_delayed_plugin_initialization()
                    yield plugin
 def disabled_device_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, DevicePlugin):
            if is_disabled(plugin):
                if platform in plugin.supported_platforms:
                    yield plugin
 # }}}
 # Metadata sources2 {{{
 def metadata_plugins(capabilities):
    capabilities = frozenset(capabilities)
    for plugin in all_metadata_plugins():
        if plugin.capabilities.intersection(capabilities) and \
                not is_disabled(plugin):
            yield plugin
 def all_metadata_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, Source):
            yield plugin
 def patch_metadata_plugins(possibly_updated_plugins):
    patches = {}
    for i, plugin in enumerate(_initialized_plugins):
        if isinstance(plugin, Source) and plugin.name in builtin_names:
            pup = possibly_updated_plugins.get(plugin.name)
            if pup is not None:
                if pup.version > plugin.version and pup.minimum_calibre_version <= numeric_version:
                    patches[i] = pup(None)
                    # Metadata source plugins dont use initialize() but that
                    # might change in the future, so be safe.
                    patches[i].initialize()
    for i, pup in iteritems(patches):
        _initialized_plugins[i] = pup
 # }}}
 # Editor plugins {{{
 def all_edit_book_tool_plugins():
    for plugin in _initialized_plugins:
        if isinstance(plugin, EditBookToolPlugin):
            yield plugin
 # }}}
 # Initialize plugins {{{
 _initialized_plugins = []
 def initialize_plugin(plugin, path_to_zip_file):
    try:
        p = plugin(path_to_zip_file)
        p.initialize()
        return p
    except Exception:
        print('Failed to initialize plugin:', plugin.name, plugin.version)
        tb = traceback.format_exc()
        raise InvalidPlugin((_('Initialization of plugin %s failed with traceback:')
                            %tb) + '\n'+tb)
 def has_external_plugins():
    'True if there are updateable (ZIP file based) plugins'
    return bool(config['plugins'])
 def initialize_plugins(perf=False):
    global _initialized_plugins
    _initialized_plugins = []
    conflicts = [name for name in config['plugins'] if name in
            builtin_names]
    for p in conflicts:
        remove_plugin(p)
    external_plugins = config['plugins'].copy()
    for name in BLACKLISTED_PLUGINS:
        external_plugins.pop(name, None)
    ostdout, ostderr = sys.stdout, sys.stderr
    if perf:
        from collections import defaultdict
        import time
        times = defaultdict(lambda:0)
    for zfp in list(external_plugins) + builtin_plugins:
        try:
            if not isinstance(zfp, type):
                # We have a plugin name
                pname = zfp
                zfp = os.path.join(plugin_dir, zfp+'.zip')
                if not os.path.exists(zfp):
                    zfp = external_plugins[pname]
            try:
                plugin = load_plugin(zfp) if not isinstance(zfp, type) else zfp
            except PluginNotFound:
                continue
            if perf:
                st = time.time()
            plugin = initialize_plugin(plugin, None if isinstance(zfp, type) else zfp)
            if perf:
                times[plugin.name] = time.time() - st
            _initialized_plugins.append(plugin)
        except:
            print('Failed to initialize plugin:', repr(zfp))
            if DEBUG:
                traceback.print_exc()
    # Prevent a custom plugin from overriding stdout/stderr as this breaks
    # ipython
    sys.stdout, sys.stderr = ostdout, ostderr
    if perf:
        for x in sorted(times, key=lambda x: times[x]):
            print('%50s: %.3f'%(x, times[x]))
    _initialized_plugins.sort(key=lambda x: x.priority, reverse=True)
    reread_filetype_plugins()
    reread_metadata_plugins()
 initialize_plugins()
 def initialized_plugins():
    for plugin in _initialized_plugins:
        yield plugin
 # }}}
 # CLI {{{
 def build_plugin(path):
    from calibre import prints
    from calibre.ptempfile import PersistentTemporaryFile
    from calibre.utils.zipfile import ZipFile, ZIP_STORED
    path = unicode_type(path)
    names = frozenset(os.listdir(path))
    if '__init__.py' not in names:
        prints(path, ' is not a valid plugin')
        raise SystemExit(1)
    t = PersistentTemporaryFile(u'.zip')
    with ZipFile(t, 'w', ZIP_STORED) as zf:
        zf.add_dir(path, simple_filter=lambda x:x in {'.git', '.bzr', '.svn', '.hg'})
    t.close()
    plugin = add_plugin(t.name)
    os.remove(t.name)
    prints('Plugin updated:', plugin.name, plugin.version)
 def option_parser():
    parser = OptionParser(usage=_('''\
    %prog options
    Customize calibre by loading external plugins.
    '''))
    parser.add_option('-a', '--add-plugin', default=None,
                      help=_('Add a plugin by specifying the path to the ZIP file containing it.'))
    parser.add_option('-b', '--build-plugin', default=None,
            help=_('For plugin developers: Path to the directory where you are'
                ' developing the plugin. This command will automatically zip '
                'up the plugin and update it in calibre.'))
    parser.add_option('-r', '--remove-plugin', default=None,
                      help=_('Remove a custom plugin by name. Has no effect on builtin plugins'))
    parser.add_option('--customize-plugin', default=None,
                      help=_('Customize plugin. Specify name of plugin and customization string separated by a comma.'))
    parser.add_option('-l', '--list-plugins', default=False, action='store_true',
                      help=_('List all installed plugins'))
    parser.add_option('--enable-plugin', default=None,
                      help=_('Enable the named plugin'))
    parser.add_option('--disable-plugin', default=None,
                      help=_('Disable the named plugin'))
    return parser
 def main(args=sys.argv):
    parser = option_parser()
    if len(args) < 2:
        parser.print_help()
        return 1
    opts, args = parser.parse_args(args)
    if opts.add_plugin is not None:
        plugin = add_plugin(opts.add_plugin)
        print('Plugin added:', plugin.name, plugin.version)
    if opts.build_plugin is not None:
        build_plugin(opts.build_plugin)
    if opts.remove_plugin is not None:
        if remove_plugin(opts.remove_plugin):
            print('Plugin removed')
        else:
            print('No custom plugin named', opts.remove_plugin)
    if opts.customize_plugin is not None:
        name, custom = opts.customize_plugin.split(',')
        plugin = find_plugin(name.strip())
        if plugin is None:
            print('No plugin with the name %s exists'%name)
            return 1
        customize_plugin(plugin, custom)
    if opts.enable_plugin is not None:
        enable_plugin(opts.enable_plugin.strip())
    if opts.disable_plugin is not None:
        disable_plugin(opts.disable_plugin.strip())
    if opts.list_plugins:
        type_len = name_len = 0
        for plugin in initialized_plugins():
            type_len, name_len = max(type_len, len(plugin.type)), max(name_len, len(plugin.name))
        fmt = '%-{}s%-{}s%-15s%-15s%s'.format(type_len+1, name_len+1)
        print(fmt%tuple(('Type|Name|Version|Disabled|Site Customization'.split('|'))))
        print()
        for plugin in initialized_plugins():
            print(fmt%(
                                plugin.type, plugin.name,
                                plugin.version, is_disabled(plugin),
                                plugin_customization(plugin)
                                ))
            print('\t', plugin.description)
            if plugin.is_customizable():
                try:
                    print('\t', plugin.customization_help())
                except NotImplementedError:
                    pass
            print()
    return 0
 if __name__ == '__main__':
    sys.exit(main())
 # }}}
--- a/ebook_converter/customize/zipplugin.py
+++ b/ebook_converter/customize/zipplugin.py
@@ -0,0 +1,320 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os, zipfile, posixpath, importlib, threading, re, imp, sys
 from collections import OrderedDict
 from functools import partial
 from calibre import as_unicode
 from calibre.constants import ispy3
 from calibre.customize import (Plugin, numeric_version, platform,
        InvalidPlugin, PluginNotFound)
 from polyglot.builtins import (itervalues, map, string_or_bytes,
        unicode_type, reload)
 # PEP 302 based plugin loading mechanism, works around the bug in zipimport in
 # python 2.x that prevents importing from zip files in locations whose paths
 # have non ASCII characters
 def get_resources(zfp, name_or_list_of_names):
    '''
    Load resources from the plugin zip file
    :param name_or_list_of_names: List of paths to resources in the zip file using / as
                separator, or a single path
    :return: A dictionary of the form ``{name : file_contents}``. Any names
                that were not found in the zip file will not be present in the
                dictionary. If a single path is passed in the return value will
                be just the bytes of the resource or None if it wasn't found.
    '''
    names = name_or_list_of_names
    if isinstance(names, string_or_bytes):
        names = [names]
    ans = {}
    with zipfile.ZipFile(zfp) as zf:
        for name in names:
            try:
                ans[name] = zf.read(name)
            except:
                import traceback
                traceback.print_exc()
    if len(names) == 1:
        ans = ans.pop(names[0], None)
    return ans
 def get_icons(zfp, name_or_list_of_names):
    '''
    Load icons from the plugin zip file
    :param name_or_list_of_names: List of paths to resources in the zip file using / as
                separator, or a single path
    :return: A dictionary of the form ``{name : QIcon}``. Any names
                that were not found in the zip file will be null QIcons.
                If a single path is passed in the return value will
                be A QIcon.
    '''
    from PyQt5.Qt import QIcon, QPixmap
    names = name_or_list_of_names
    ans = get_resources(zfp, names)
    if isinstance(names, string_or_bytes):
        names = [names]
    if ans is None:
        ans = {}
    if isinstance(ans, string_or_bytes):
        ans = dict([(names[0], ans)])
    ians = {}
    for name in names:
        p = QPixmap()
        raw = ans.get(name, None)
        if raw:
            p.loadFromData(raw)
        ians[name] = QIcon(p)
    if len(names) == 1:
        ians = ians.pop(names[0])
    return ians
 _translations_cache = {}
 def load_translations(namespace, zfp):
    null = object()
    trans = _translations_cache.get(zfp, null)
    if trans is None:
        return
    if trans is null:
        from calibre.utils.localization import get_lang
        lang = get_lang()
        if not lang or lang == 'en':  # performance optimization
            _translations_cache[zfp] = None
            return
        with zipfile.ZipFile(zfp) as zf:
            try:
                mo = zf.read('translations/%s.mo' % lang)
            except KeyError:
                mo = None  # No translations for this language present
        if mo is None:
            _translations_cache[zfp] = None
            return
        from gettext import GNUTranslations
        from io import BytesIO
        trans = _translations_cache[zfp] = GNUTranslations(BytesIO(mo))
    namespace['_'] = getattr(trans, 'gettext' if ispy3 else 'ugettext')
    namespace['ngettext'] = getattr(trans, 'ngettext' if ispy3 else 'ungettext')
 class PluginLoader(object):
    def __init__(self):
        self.loaded_plugins = {}
        self._lock = threading.RLock()
        self._identifier_pat = re.compile(r'[a-zA-Z][_0-9a-zA-Z]*')
    def _get_actual_fullname(self, fullname):
        parts = fullname.split('.')
        if parts[0] == 'calibre_plugins':
            if len(parts) == 1:
                return parts[0], None
            plugin_name = parts[1]
            with self._lock:
                names = self.loaded_plugins.get(plugin_name, None)
                if names is None:
                    raise ImportError('No plugin named %r loaded'%plugin_name)
                names = names[1]
                fullname = '.'.join(parts[2:])
                if not fullname:
                    fullname = '__init__'
                if fullname in names:
                    return fullname, plugin_name
                if fullname+'.__init__' in names:
                    return fullname+'.__init__', plugin_name
        return None, None
    def find_module(self, fullname, path=None):
        fullname, plugin_name = self._get_actual_fullname(fullname)
        if fullname is None and plugin_name is None:
            return None
        return self
    def load_module(self, fullname):
        import_name, plugin_name = self._get_actual_fullname(fullname)
        if import_name is None and plugin_name is None:
            raise ImportError('No plugin named %r is loaded'%fullname)
        mod = sys.modules.setdefault(fullname, imp.new_module(fullname))
        mod.__file__ = "<calibre Plugin Loader>"
        mod.__loader__ = self
        if import_name.endswith('.__init__') or import_name in ('__init__',
                'calibre_plugins'):
            # We have a package
            mod.__path__ = []
        if plugin_name is not None:
            # We have some actual code to load
            with self._lock:
                zfp, names = self.loaded_plugins.get(plugin_name, (None, None))
            if names is None:
                raise ImportError('No plugin named %r loaded'%plugin_name)
            zinfo = names.get(import_name, None)
            if zinfo is None:
                raise ImportError('Plugin %r has no module named %r' %
                        (plugin_name, import_name))
            with zipfile.ZipFile(zfp) as zf:
                try:
                    code = zf.read(zinfo)
                except:
                    # Maybe the zip file changed from under us
                    code = zf.read(zinfo.filename)
            compiled = compile(code, 'calibre_plugins.%s.%s'%(plugin_name,
                import_name), 'exec', dont_inherit=True)
            mod.__dict__['get_resources'] = partial(get_resources, zfp)
            mod.__dict__['get_icons'] = partial(get_icons, zfp)
            mod.__dict__['load_translations'] = partial(load_translations, mod.__dict__, zfp)
            exec(compiled, mod.__dict__)
        return mod
    def load(self, path_to_zip_file):
        if not os.access(path_to_zip_file, os.R_OK):
            raise PluginNotFound('Cannot access %r'%path_to_zip_file)
        with zipfile.ZipFile(path_to_zip_file) as zf:
            plugin_name = self._locate_code(zf, path_to_zip_file)
        try:
            ans = None
            plugin_module = 'calibre_plugins.%s'%plugin_name
            m = sys.modules.get(plugin_module, None)
            if m is not None:
                reload(m)
            else:
                m = importlib.import_module(plugin_module)
            plugin_classes = []
            for obj in itervalues(m.__dict__):
                if isinstance(obj, type) and issubclass(obj, Plugin) and \
                        obj.name != 'Trivial Plugin':
                    plugin_classes.append(obj)
            if not plugin_classes:
                raise InvalidPlugin('No plugin class found in %s:%s'%(
                    as_unicode(path_to_zip_file), plugin_name))
            if len(plugin_classes) > 1:
                plugin_classes.sort(key=lambda c:(getattr(c, '__module__', None) or '').count('.'))
            ans = plugin_classes[0]
            if ans.minimum_calibre_version > numeric_version:
                raise InvalidPlugin(
                    'The plugin at %s needs a version of calibre >= %s' %
                    (as_unicode(path_to_zip_file), '.'.join(map(unicode_type,
                        ans.minimum_calibre_version))))
            if platform not in ans.supported_platforms:
                raise InvalidPlugin(
                    'The plugin at %s cannot be used on %s' %
                    (as_unicode(path_to_zip_file), platform))
            return ans
        except:
            with self._lock:
                del self.loaded_plugins[plugin_name]
            raise
    def _locate_code(self, zf, path_to_zip_file):
        names = [x if isinstance(x, unicode_type) else x.decode('utf-8') for x in
                zf.namelist()]
        names = [x[1:] if x[0] == '/' else x for x in names]
        plugin_name = None
        for name in names:
            name, ext = posixpath.splitext(name)
            if name.startswith('plugin-import-name-') and ext == '.txt':
                plugin_name = name.rpartition('-')[-1]
        if plugin_name is None:
            c = 0
            while True:
                c += 1
                plugin_name = 'dummy%d'%c
                if plugin_name not in self.loaded_plugins:
                    break
        else:
            if self._identifier_pat.match(plugin_name) is None:
                raise InvalidPlugin((
                    'The plugin at %r uses an invalid import name: %r' %
                    (path_to_zip_file, plugin_name)))
        pynames = [x for x in names if x.endswith('.py')]
        candidates = [posixpath.dirname(x) for x in pynames if
                x.endswith('/__init__.py')]
        candidates.sort(key=lambda x: x.count('/'))
        valid_packages = set()
        for candidate in candidates:
            parts = candidate.split('/')
            parent = '.'.join(parts[:-1])
            if parent and parent not in valid_packages:
                continue
            valid_packages.add('.'.join(parts))
        names = OrderedDict()
        for candidate in pynames:
            parts = posixpath.splitext(candidate)[0].split('/')
            package = '.'.join(parts[:-1])
            if package and package not in valid_packages:
                continue
            name = '.'.join(parts)
            names[name] = zf.getinfo(candidate)
        # Legacy plugins
        if '__init__' not in names:
            for name in tuple(names):
                if '.' not in name and name.endswith('plugin'):
                    names['__init__'] = names[name]
                    break
        if '__init__' not in names:
            raise InvalidPlugin(('The plugin in %r is invalid. It does not '
                    'contain a top-level __init__.py file')
                    % path_to_zip_file)
        with self._lock:
            self.loaded_plugins[plugin_name] = (path_to_zip_file, names)
        return plugin_name
 loader = PluginLoader()
 sys.meta_path.insert(0, loader)
 if __name__ == '__main__':
    from tempfile import NamedTemporaryFile
    from calibre.customize.ui import add_plugin
    from calibre import CurrentDir
    path = sys.argv[-1]
    with NamedTemporaryFile(suffix='.zip') as f:
        with zipfile.ZipFile(f, 'w') as zf:
            with CurrentDir(path):
                for x in os.listdir('.'):
                    if x[0] != '.':
                        print('Adding', x)
                    zf.write(x)
                    if os.path.isdir(x):
                        for y in os.listdir(x):
                            zf.write(os.path.join(x, y))
        add_plugin(f.name)
        print('Added plugin from', sys.argv[-1])
--- a/ebook_converter/devices/init.py
+++ b/ebook_converter/devices/init.py
@@ -0,0 +1,216 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 '''
 Device drivers.
 '''
 import sys, time, pprint
 from functools import partial
 from polyglot.builtins import zip, unicode_type
 DAY_MAP   = dict(Sun=0, Mon=1, Tue=2, Wed=3, Thu=4, Fri=5, Sat=6)
 MONTH_MAP = dict(Jan=1, Feb=2, Mar=3, Apr=4, May=5, Jun=6, Jul=7, Aug=8, Sep=9, Oct=10, Nov=11, Dec=12)
 INVERSE_DAY_MAP = dict(zip(DAY_MAP.values(), DAY_MAP.keys()))
 INVERSE_MONTH_MAP = dict(zip(MONTH_MAP.values(), MONTH_MAP.keys()))
 def strptime(src):
    src = src.strip()
    src = src.split()
    src[0] = unicode_type(DAY_MAP[src[0][:-1]])+','
    src[2] = unicode_type(MONTH_MAP[src[2]])
    return time.strptime(' '.join(src), '%w, %d %m %Y %H:%M:%S %Z')
 def strftime(epoch, zone=time.gmtime):
    src = time.strftime("%w, %d %m %Y %H:%M:%S GMT", zone(epoch)).split()
    src[0] = INVERSE_DAY_MAP[int(src[0][:-1])]+','
    src[2] = INVERSE_MONTH_MAP[int(src[2])]
    return ' '.join(src)
 def get_connected_device():
    from calibre.customize.ui import device_plugins
    from calibre.devices.scanner import DeviceScanner
    dev = None
    scanner = DeviceScanner()
    scanner.scan()
    connected_devices = []
    for d in device_plugins():
        ok, det = scanner.is_device_connected(d)
        if ok:
            dev = d
            dev.reset(log_packets=False, detected_device=det)
            connected_devices.append((det, dev))
    if dev is None:
        print('Unable to find a connected ebook reader.', file=sys.stderr)
        return
    for det, d in connected_devices:
        try:
            d.open(det, None)
        except:
            continue
        else:
            dev = d
            break
    return dev
 def debug(ioreg_to_tmp=False, buf=None, plugins=None,
        disabled_plugins=None):
    '''
    If plugins is None, then this method calls startup and shutdown on the
    device plugins. So if you are using it in a context where startup could
    already have been called (for example in the main GUI), pass in the list of
    device plugins as the plugins parameter.
    '''
    import textwrap
    from calibre.customize.ui import device_plugins, disabled_device_plugins
    from calibre.debug import print_basic_debug_info
    from calibre.devices.scanner import DeviceScanner
    from calibre.constants import iswindows, isosx
    from calibre import prints
    from polyglot.io import PolyglotBytesIO
    oldo, olde = sys.stdout, sys.stderr
    if buf is None:
        buf = PolyglotBytesIO()
    sys.stdout = sys.stderr = buf
    out = partial(prints, file=buf)
    devplugins = device_plugins() if plugins is None else plugins
    devplugins = list(sorted(devplugins, key=lambda x: x.__class__.__name__))
    if plugins is None:
        for d in devplugins:
            try:
                d.startup()
            except:
                out('Startup failed for device plugin: %s'%d)
    if disabled_plugins is None:
        disabled_plugins = list(disabled_device_plugins())
    try:
        print_basic_debug_info(out=buf)
        s = DeviceScanner()
        s.scan()
        devices = (s.devices)
        if not iswindows:
            devices = [list(x) for x in devices]
            for d in devices:
                for i in range(3):
                    d[i] = hex(d[i])
        out('USB devices on system:')
        out(pprint.pformat(devices))
        ioreg = None
        if isosx:
            from calibre.devices.usbms.device import Device
            mount = '\n'.join(repr(x) for x in Device.osx_run_mount().splitlines())
            drives = pprint.pformat(Device.osx_get_usb_drives())
            ioreg = 'Output from mount:\n'+mount+'\n\n'
            ioreg += 'Output from osx_get_usb_drives:\n'+drives+'\n\n'
            ioreg += Device.run_ioreg()
        connected_devices = []
        if disabled_plugins:
            out('\nDisabled plugins:', textwrap.fill(' '.join([x.__class__.__name__ for x in
                disabled_plugins])))
            out(' ')
        else:
            out('\nNo disabled plugins')
        found_dev = False
        for dev in devplugins:
            if not dev.MANAGES_DEVICE_PRESENCE:
                continue
            out('Looking for devices of type:', dev.__class__.__name__)
            if dev.debug_managed_device_detection(s.devices, buf):
                found_dev = True
                break
            out(' ')
        if not found_dev:
            out('Looking for devices...')
            for dev in devplugins:
                if dev.MANAGES_DEVICE_PRESENCE:
                    continue
                connected, det = s.is_device_connected(dev, debug=True)
                if connected:
                    out('\t\tDetected possible device', dev.__class__.__name__)
                    connected_devices.append((dev, det))
            out(' ')
            errors = {}
            success = False
            out('Devices possibly connected:', end=' ')
            for dev, det in connected_devices:
                out(dev.name, end=', ')
            if not connected_devices:
                out('None', end='')
            out(' ')
            for dev, det in connected_devices:
                out('Trying to open', dev.name, '...', end=' ')
                dev.do_device_debug = True
                try:
                    dev.reset(detected_device=det)
                    dev.open(det, None)
                    out('OK')
                except:
                    import traceback
                    errors[dev] = traceback.format_exc()
                    out('failed')
                    continue
                dev.do_device_debug = False
                success = True
                if hasattr(dev, '_main_prefix'):
                    out('Main memory:', repr(dev._main_prefix))
                out('Total space:', dev.total_space())
                break
            if not success and errors:
                out('Opening of the following devices failed')
                for dev,msg in errors.items():
                    out(dev)
                    out(msg)
                    out(' ')
            if ioreg is not None:
                ioreg = 'IOREG Output\n'+ioreg
                out(' ')
                if ioreg_to_tmp:
                    lopen('/tmp/ioreg.txt', 'wb').write(ioreg)
                    out('Dont forget to send the contents of /tmp/ioreg.txt')
                    out('You can open it with the command: open /tmp/ioreg.txt')
                else:
                    out(ioreg)
        if hasattr(buf, 'getvalue'):
            return buf.getvalue().decode('utf-8', 'replace')
    finally:
        sys.stdout = oldo
        sys.stderr = olde
        if plugins is None:
            for d in devplugins:
                try:
                    d.shutdown()
                except:
                    pass
 def device_info(ioreg_to_tmp=False, buf=None):
    from calibre.devices.scanner import DeviceScanner
    res = {}
    res['device_set'] = device_set = set()
    res['device_details'] = device_details = {}
    s = DeviceScanner()
    s.scan()
    devices = s.devices
    devices = [tuple(x) for x in devices]
    for dev in devices:
        device_set.add(dev)
        device_details[dev] = dev[0:3]
    return res
--- a/ebook_converter/devices/interface.py
+++ b/ebook_converter/devices/interface.py
@@ -0,0 +1,787 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import os
 from collections import namedtuple
 from calibre import prints
 from calibre.constants import iswindows
 from calibre.customize import Plugin
 class DevicePlugin(Plugin):
    """
    Defines the interface that should be implemented by backends that
    communicate with an e-book reader.
    """
    type = _('Device interface')
    #: Ordered list of supported formats
    FORMATS     = ["lrf", "rtf", "pdf", "txt"]
    # If True, the config dialog will not show the formats box
    HIDE_FORMATS_CONFIG_BOX = False
    #: VENDOR_ID can be either an integer, a list of integers or a dictionary
    #: If it is a dictionary, it must be a dictionary of dictionaries,
    #: of the form::
    #:
    #:   {
    #:    integer_vendor_id : { product_id : [list of BCDs], ... },
    #:    ...
    #:   }
    #:
    VENDOR_ID   = 0x0000
    #: An integer or a list of integers
    PRODUCT_ID  = 0x0000
    #: BCD can be either None to not distinguish between devices based on BCD, or
    #: it can be a list of the BCD numbers of all devices supported by this driver.
    BCD         = None
    #: Height for thumbnails on the device
    THUMBNAIL_HEIGHT = 68
    #: Compression quality for thumbnails. Set this closer to 100 to have better
    #: quality thumbnails with fewer compression artifacts. Of course, the
    #: thumbnails get larger as well.
    THUMBNAIL_COMPRESSION_QUALITY = 75
    #: Set this to True if the device supports updating cover thumbnails during
    #: sync_booklists. Setting it to true will ask device.py to refresh the
    #: cover thumbnails during book matching
    WANTS_UPDATED_THUMBNAILS = False
    #: Whether the metadata on books can be set via the GUI.
    CAN_SET_METADATA = ['title', 'authors', 'collections']
    #: Whether the device can handle device_db metadata plugboards
    CAN_DO_DEVICE_DB_PLUGBOARD = False
    # Set this to None if the books on the device are files that the GUI can
    # access in order to add the books from the device to the library
    BACKLOADING_ERROR_MESSAGE = _('Cannot get files from this device')
    #: Path separator for paths to books on device
    path_sep = os.sep
    #: Icon for this device
    icon = I('reader.png')
    # Encapsulates an annotation fetched from the device
    UserAnnotation = namedtuple('Annotation','type, value')
    #: GUI displays this as a message if not None. Useful if opening can take a
    #: long time
    OPEN_FEEDBACK_MESSAGE = None
    #: Set of extensions that are "virtual books" on the device
    #: and therefore cannot be viewed/saved/added to library.
    #: For example: ``frozenset(['kobo'])``
    VIRTUAL_BOOK_EXTENSIONS = frozenset()
    #: Message to display to user for virtual book extensions.
    VIRTUAL_BOOK_EXTENSION_MESSAGE = None
    #: Whether to nuke comments in the copy of the book sent to the device. If
    #: not None this should be short string that the comments will be replaced
    #: by.
    NUKE_COMMENTS = None
    #: If True indicates that  this driver completely manages device detection,
    #: ejecting and so forth. If you set this to True, you *must* implement the
    #: detect_managed_devices and debug_managed_device_detection methods.
    #: A driver with this set to true is responsible for detection of devices,
    #: managing a blacklist of devices, a list of ejected devices and so forth.
    #: calibre will periodically call the detect_managed_devices() method and
    #: if it returns a detected device, calibre will call open(). open() will
    #: be called every time a device is returned even if previous calls to open()
    #: failed, therefore the driver must maintain its own blacklist of failed
    #: devices. Similarly, when ejecting, calibre will call eject() and then
    #: assuming the next call to detect_managed_devices() returns None, it will
    #: call post_yank_cleanup().
    MANAGES_DEVICE_PRESENCE = False
    #: If set the True, calibre will call the :meth:`get_driveinfo()` method
    #: after the books lists have been loaded to get the driveinfo.
    SLOW_DRIVEINFO = False
    #: If set to True, calibre will ask the user if they want to manage the
    #: device with calibre, the first time it is detected. If you set this to
    #: True you must implement :meth:`get_device_uid()` and
    #: :meth:`ignore_connected_device()` and
    #: :meth:`get_user_blacklisted_devices` and
    #: :meth:`set_user_blacklisted_devices`
    ASK_TO_ALLOW_CONNECT = False
    #: Set this to a dictionary of the form {'title':title, 'msg':msg, 'det_msg':detailed_msg} to have calibre popup
    #: a message to the user after some callbacks are run (currently only upload_books).
    #: Be careful to not spam the user with too many messages. This variable is checked after *every* callback,
    #: so only set it when you really need to.
    user_feedback_after_callback = None
    @classmethod
    def get_gui_name(cls):
        if hasattr(cls, 'gui_name'):
            return cls.gui_name
        if hasattr(cls, '__name__'):
            return cls.__name__
        return cls.name
    # Device detection {{{
    def test_bcd(self, bcdDevice, bcd):
        if bcd is None or len(bcd) == 0:
            return True
        for c in bcd:
            if c == bcdDevice:
                return True
        return False
    def is_usb_connected(self, devices_on_system, debug=False, only_presence=False):
        '''
        Return True, device_info if a device handled by this plugin is currently connected.
        :param devices_on_system: List of devices currently connected
        '''
        vendors_on_system = {x[0] for x in devices_on_system}
        vendors = set(self.VENDOR_ID) if hasattr(self.VENDOR_ID, '__len__') else {self.VENDOR_ID}
        if hasattr(self.VENDOR_ID, 'keys'):
            products = []
            for ven in self.VENDOR_ID:
                products.extend(self.VENDOR_ID[ven].keys())
        else:
            products = self.PRODUCT_ID if hasattr(self.PRODUCT_ID, '__len__') else [self.PRODUCT_ID]
        ch = self.can_handle_windows if iswindows else self.can_handle
        for vid in vendors_on_system.intersection(vendors):
            for dev in devices_on_system:
                cvid, pid, bcd = dev[:3]
                if cvid == vid:
                    if pid in products:
                        if hasattr(self.VENDOR_ID, 'keys'):
                            try:
                                cbcd = self.VENDOR_ID[vid][pid]
                            except KeyError:
                                # Vendor vid does not have product pid, pid
                                # exists for some other vendor in this
                                # device
                                continue
                        else:
                            cbcd = self.BCD
                        if self.test_bcd(bcd, cbcd):
                            if debug:
                                prints(dev)
                            if ch(dev, debug=debug):
                                return True, dev
        return False, None
    def detect_managed_devices(self, devices_on_system, force_refresh=False):
        '''
        Called only if MANAGES_DEVICE_PRESENCE is True.
        Scan for devices that this driver can handle. Should return a device
        object if a device is found. This object will be passed to the open()
        method as the connected_device. If no device is found, return None. The
        returned object can be anything, calibre does not use it, it is only
        passed to open().
        This method is called periodically by the GUI, so make sure it is not
        too resource intensive. Use a cache to avoid repeatedly scanning the
        system.
        :param devices_on_system: Set of USB devices found on the system.
        :param force_refresh: If True and the driver uses a cache to prevent
                              repeated scanning, the cache must be flushed.
        '''
        raise NotImplementedError()
    def debug_managed_device_detection(self, devices_on_system, output):
        '''
        Called only if MANAGES_DEVICE_PRESENCE is True.
        Should write information about the devices detected on the system to
        output, which is a file like object.
        Should return True if a device was detected and successfully opened,
        otherwise False.
        '''
        raise NotImplementedError()
    # }}}
    def reset(self, key='-1', log_packets=False, report_progress=None,
            detected_device=None):
        """
        :param key: The key to unlock the device
        :param log_packets: If true the packet stream to/from the device is logged
        :param report_progress: Function that is called with a % progress
                                (number between 0 and 100) for various tasks
                                If it is called with -1 that means that the
                                task does not have any progress information
        :param detected_device: Device information from the device scanner
        """
        raise NotImplementedError()
    def can_handle_windows(self, usbdevice, debug=False):
        '''
        Optional method to perform further checks on a device to see if this driver
        is capable of handling it. If it is not it should return False. This method
        is only called after the vendor, product ids and the bcd have matched, so
        it can do some relatively time intensive checks. The default implementation
        returns True. This method is called only on Windows. See also
        :meth:`can_handle`.
        Note that for devices based on USBMS this method by default delegates
        to :meth:`can_handle`.  So you only need to override :meth:`can_handle`
        in your subclass of USBMS.
        :param usbdevice: A usbdevice as returned by :func:`calibre.devices.winusb.scan_usb_devices`
        '''
        return True
    def can_handle(self, device_info, debug=False):
        '''
        Unix version of :meth:`can_handle_windows`.
        :param device_info: Is a tuple of (vid, pid, bcd, manufacturer, product,
                            serial number)
        '''
        return True
    can_handle.is_base_class_implementation = True
    def open(self, connected_device, library_uuid):
        '''
        Perform any device specific initialization. Called after the device is
        detected but before any other functions that communicate with the device.
        For example: For devices that present themselves as USB Mass storage
        devices, this method would be responsible for mounting the device or
        if the device has been automounted, for finding out where it has been
        mounted. The method :meth:`calibre.devices.usbms.device.Device.open` has
        an implementation of
        this function that should serve as a good example for USB Mass storage
        devices.
        This method can raise an OpenFeedback exception to display a message to
        the user.
        :param connected_device: The device that we are trying to open. It is
            a tuple of (vendor id, product id, bcd, manufacturer name, product
            name, device serial number). However, some devices have no serial
            number and on windows only the first three fields are present, the
            rest are None.
        :param library_uuid: The UUID of the current calibre library. Can be
            None if there is no library (for example when used from the command
            line).
        '''
        raise NotImplementedError()
    def eject(self):
        '''
        Un-mount / eject the device from the OS. This does not check if there
        are pending GUI jobs that need to communicate with the device.
        NOTE: That this method may not be called on the same thread as the rest
        of the device methods.
        '''
        raise NotImplementedError()
    def post_yank_cleanup(self):
        '''
        Called if the user yanks the device without ejecting it first.
        '''
        raise NotImplementedError()
    def set_progress_reporter(self, report_progress):
        '''
        Set a function to report progress information.
        :param report_progress: Function that is called with a % progress
                                (number between 0 and 100) for various tasks
                                If it is called with -1 that means that the
                                task does not have any progress information
        '''
        raise NotImplementedError()
    def get_device_information(self, end_session=True):
        """
        Ask device for device information. See L{DeviceInfoQuery}.
        :return: (device name, device version, software version on device, mime type)
                 The tuple can optionally have a fifth element, which is a
                 drive information dictionary. See usbms.driver for an example.
        """
        raise NotImplementedError()
    def get_driveinfo(self):
        '''
        Return the driveinfo dictionary. Usually called from
        get_device_information(), but if loading the driveinfo is slow for this
        driver, then it should set SLOW_DRIVEINFO. In this case, this method
        will be called by calibre after the book lists have been loaded. Note
        that it is not called on the device thread, so the driver should cache
        the drive info in the books() method and this function should return
        the cached data.
        '''
        return {}
    def card_prefix(self, end_session=True):
        '''
        Return a 2 element list of the prefix to paths on the cards.
        If no card is present None is set for the card's prefix.
        E.G.
        ('/place', '/place2')
        (None, 'place2')
        ('place', None)
        (None, None)
        '''
        raise NotImplementedError()
    def total_space(self, end_session=True):
        """
        Get total space available on the mountpoints:
            1. Main memory
            2. Memory Card A
            3. Memory Card B
        :return: A 3 element list with total space in bytes of (1, 2, 3). If a
                 particular device doesn't have any of these locations it should return 0.
        """
        raise NotImplementedError()
    def free_space(self, end_session=True):
        """
        Get free space available on the mountpoints:
          1. Main memory
          2. Card A
          3. Card B
        :return: A 3 element list with free space in bytes of (1, 2, 3). If a
                 particular device doesn't have any of these locations it should return -1.
        """
        raise NotImplementedError()
    def books(self, oncard=None, end_session=True):
        """
        Return a list of e-books on the device.
        :param oncard:  If 'carda' or 'cardb' return a list of e-books on the
                        specific storage card, otherwise return list of e-books
                        in main memory of device. If a card is specified and no
                        books are on the card return empty list.
        :return: A BookList.
        """
        raise NotImplementedError()
    def upload_books(self, files, names, on_card=None, end_session=True,
                     metadata=None):
        '''
        Upload a list of books to the device. If a file already
        exists on the device, it should be replaced.
        This method should raise a :class:`FreeSpaceError` if there is not enough
        free space on the device. The text of the FreeSpaceError must contain the
        word "card" if ``on_card`` is not None otherwise it must contain the word "memory".
        :param files: A list of paths
        :param names: A list of file names that the books should have
                      once uploaded to the device. len(names) == len(files)
        :param metadata: If not None, it is a list of :class:`Metadata` objects.
                         The idea is to use the metadata to determine where on the device to
                         put the book. len(metadata) == len(files). Apart from the regular
                         cover (path to cover), there may also be a thumbnail attribute, which should
                         be used in preference. The thumbnail attribute is of the form
                         (width, height, cover_data as jpeg).
        :return: A list of 3-element tuples. The list is meant to be passed
                 to :meth:`add_books_to_metadata`.
        '''
        raise NotImplementedError()
    @classmethod
    def add_books_to_metadata(cls, locations, metadata, booklists):
        '''
        Add locations to the booklists. This function must not communicate with
        the device.
        :param locations: Result of a call to L{upload_books}
        :param metadata: List of :class:`Metadata` objects, same as for
                         :meth:`upload_books`.
        :param booklists: A tuple containing the result of calls to
                          (:meth:`books(oncard=None)`,
                          :meth:`books(oncard='carda')`,
                          :meth`books(oncard='cardb')`).
        '''
        raise NotImplementedError()
    def delete_books(self, paths, end_session=True):
        '''
        Delete books at paths on device.
        '''
        raise NotImplementedError()
    @classmethod
    def remove_books_from_metadata(cls, paths, booklists):
        '''
        Remove books from the metadata list. This function must not communicate
        with the device.
        :param paths: paths to books on the device.
        :param booklists: A tuple containing the result of calls to
                          (:meth:`books(oncard=None)`,
                          :meth:`books(oncard='carda')`,
                          :meth`books(oncard='cardb')`).
        '''
        raise NotImplementedError()
    def sync_booklists(self, booklists, end_session=True):
        '''
        Update metadata on device.
        :param booklists: A tuple containing the result of calls to
                          (:meth:`books(oncard=None)`,
                          :meth:`books(oncard='carda')`,
                          :meth`books(oncard='cardb')`).
        '''
        raise NotImplementedError()
    def get_file(self, path, outfile, end_session=True):
        '''
        Read the file at ``path`` on the device and write it to outfile.
        :param outfile: file object like ``sys.stdout`` or the result of an
                       :func:`open` call.
        '''
        raise NotImplementedError()
    @classmethod
    def config_widget(cls):
        '''
        Should return a QWidget. The QWidget contains the settings for the
        device interface
        '''
        raise NotImplementedError()
    @classmethod
    def save_settings(cls, settings_widget):
        '''
        Should save settings to disk. Takes the widget created in
        :meth:`config_widget` and saves all settings to disk.
        '''
        raise NotImplementedError()
    @classmethod
    def settings(cls):
        '''
        Should return an opts object. The opts object should have at least one
        attribute `format_map` which is an ordered list of formats for the
        device.
        '''
        raise NotImplementedError()
    def set_plugboards(self, plugboards, pb_func):
        '''
        provide the driver the current set of plugboards and a function to
        select a specific plugboard. This method is called immediately before
        add_books and sync_booklists.
        pb_func is a callable with the following signature::
            def pb_func(device_name, format, plugboards)
        You give it the current device name (either the class name or
        DEVICE_PLUGBOARD_NAME), the format you are interested in (a 'real'
        format or 'device_db'), and the plugboards (you were given those by
        set_plugboards, the same place you got this method).
        :return: None or a single plugboard instance.
        '''
        pass
    def set_driveinfo_name(self, location_code, name):
        '''
        Set the device name in the driveinfo file to 'name'. This setting will
        persist until the file is re-created or the name is changed again.
        Non-disk devices should implement this method based on the location
        codes returned by the get_device_information() method.
        '''
        pass
    def prepare_addable_books(self, paths):
        '''
        Given a list of paths, returns another list of paths. These paths
        point to addable versions of the books.
        If there is an error preparing a book, then instead of a path, the
        position in the returned list for that book should be a three tuple:
        (original_path, the exception instance, traceback)
        '''
        return paths
    def startup(self):
        '''
        Called when calibre is starting the device. Do any initialization
        required. Note that multiple instances of the class can be instantiated,
        and thus __init__ can be called multiple times, but only one instance
        will have this method called. This method is called on the device
        thread, not the GUI thread.
        '''
        pass
    def shutdown(self):
        '''
        Called when calibre is shutting down, either for good or in preparation
        to restart. Do any cleanup required. This method is called on the
        device thread, not the GUI thread.
        '''
        pass
    def get_device_uid(self):
        '''
        Must return a unique id for the currently connected device (this is
        called immediately after a successful call to open()). You must
        implement this method if you set ASK_TO_ALLOW_CONNECT = True
        '''
        raise NotImplementedError()
    def ignore_connected_device(self, uid):
        '''
        Should ignore the device identified by uid (the result of a call to
        get_device_uid()) in the future. You must implement this method if you
        set ASK_TO_ALLOW_CONNECT = True. Note that this function is called
        immediately after open(), so if open() caches some state, the driver
        should reset that state.
        '''
        raise NotImplementedError()
    def get_user_blacklisted_devices(self):
        '''
        Return map of device uid to friendly name for all devices that the user
        has asked to be ignored.
        '''
        return {}
    def set_user_blacklisted_devices(self, devices):
        '''
        Set the list of device uids that should be ignored by this driver.
        '''
        pass
    def specialize_global_preferences(self, device_prefs):
        '''
        Implement this method if your device wants to override a particular
        preference. You must ensure that all call sites that want a preference
        that can be overridden use device_prefs['something'] instead
        of prefs['something']. Your
        method should call device_prefs.set_overrides(pref=val, pref=val, ...).
        Currently used for:
        metadata management (prefs['manage_device_metadata'])
        '''
        device_prefs.set_overrides()
    def set_library_info(self, library_name, library_uuid, field_metadata):
        '''
        Implement this method if you want information about the current calibre
        library. This method is called at startup and when the calibre library
        changes while connected.
        '''
        pass
    # Dynamic control interface.
    # The following methods are probably called on the GUI thread. Any driver
    # that implements these methods must take pains to be thread safe, because
    # the device_manager might be using the driver at the same time that one of
    # these methods is called.
    def is_dynamically_controllable(self):
        '''
        Called by the device manager when starting plugins. If this method returns
        a string, then a) it supports the device manager's dynamic control
        interface, and b) that name is to be used when talking to the plugin.
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        return None
    def start_plugin(self):
        '''
        This method is called to start the plugin. The plugin should begin
        to accept device connections however it does that. If the plugin is
        already accepting connections, then do nothing.
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        pass
    def stop_plugin(self):
        '''
        This method is called to stop the plugin. The plugin should no longer
        accept connections, and should cleanup behind itself. It is likely that
        this method should call shutdown. If the plugin is already not accepting
        connections, then do nothing.
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        pass
    def get_option(self, opt_string, default=None):
        '''
        Return the value of the option indicated by opt_string. This method can
        be called when the plugin is not started. Return None if the option does
        not exist.
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        return default
    def set_option(self, opt_string, opt_value):
        '''
        Set the value of the option indicated by opt_string. This method can
        be called when the plugin is not started.
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        pass
    def is_running(self):
        '''
        Return True if the plugin is started, otherwise false
        This method can be called on the GUI thread. A driver that implements
        this method must be thread safe.
        '''
        return False
    def synchronize_with_db(self, db, book_id, book_metadata, first_call):
        '''
        Called during book matching when a book on the device is matched with
        a book in calibre's db. The method is responsible for syncronizing
        data from the device to calibre's db (if needed).
        The method must return a two-value tuple. The first value is a set of
        calibre book ids changed if calibre's database was changed or None if the
        database was not changed. If the first value is an empty set then the
        metadata for the book on the device is updated with calibre's metadata
        and given back to the device, but no GUI refresh of that book is done.
        This is useful when the calibre data is correct but must be sent to the
        device.
        The second value is itself a 2-value tuple. The first value in the tuple
        specifies whether a book format should be sent to the device. The intent
        is to permit verifying that the book on the device is the same as the
        book in calibre. This value must be None if no book is to be sent,
        otherwise return the base file name on the device (a string like
        foobar.epub). Be sure to include the extension in the name. The device
        subsystem will construct a send_books job for all books with not- None
        returned values. Note: other than to later retrieve the extension, the
        name is ignored in cases where the device uses a template to generate
        the file name, which most do. The second value in the returned tuple
        indicated whether the format is future-dated. Return True if it is,
        otherwise return False. calibre will display a dialog to the user
        listing all future dated books.
        Extremely important: this method is called on the GUI thread. It must
        be threadsafe with respect to the device manager's thread.
        book_id: the calibre id for the book in the database.
        book_metadata: the Metadata object for the book coming from the device.
        first_call: True if this is the first call during a sync, False otherwise
        '''
        return (None, (None, False))
 class BookList(list):
    '''
    A list of books. Each Book object must have the fields
      #. title
      #. authors
      #. size (file size of the book)
      #. datetime (a UTC time tuple)
      #. path (path on the device to the book)
      #. thumbnail (can be None) thumbnail is either a str/bytes object with the
         image data or it should have an attribute image_path that stores an
         absolute (platform native) path to the image
      #. tags (a list of strings, can be empty).
    '''
    __getslice__ = None
    __setslice__ = None
    def __init__(self, oncard, prefix, settings):
        pass
    def supports_collections(self):
        ''' Return True if the device supports collections for this book list. '''
        raise NotImplementedError()
    def add_book(self, book, replace_metadata):
        '''
        Add the book to the booklist. Intent is to maintain any device-internal
        metadata. Return True if booklists must be sync'ed
        '''
        raise NotImplementedError()
    def remove_book(self, book):
        '''
        Remove a book from the booklist. Correct any device metadata at the
        same time
        '''
        raise NotImplementedError()
    def get_collections(self, collection_attributes):
        '''
        Return a dictionary of collections created from collection_attributes.
        Each entry in the dictionary is of the form collection name:[list of
        books]
        The list of books is sorted by book title, except for collections
        created from series, in which case series_index is used.
        :param collection_attributes: A list of attributes of the Book object
        '''
        raise NotImplementedError()
 class CurrentlyConnectedDevice(object):
    def __init__(self):
        self._device = None
    @property
    def device(self):
        return self._device
 # A device driver can check if a device is currently connected to calibre using
 # the following code::
 #   from calibre.device.interface import currently_connected_device
 #   if currently_connected_device.device is None:
 #      # no device connected
 # The device attribute will be either None or the device driver object
 # (DevicePlugin instance) for the currently connected device.
 currently_connected_device = CurrentlyConnectedDevice()
--- a/ebook_converter/ebooks/BeautifulSoup.py
+++ b/ebook_converter/ebooks/BeautifulSoup.py
@@ -0,0 +1,41 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 # License: GPLv3 Copyright: 2019, Kovid Goyal <kovid at kovidgoyal.net>
 from __future__ import absolute_import, division, print_function, unicode_literals
 import bs4
 from bs4 import (  # noqa
    CData, Comment, Declaration, NavigableString, ProcessingInstruction,
    SoupStrainer, Tag, __version__
 )
 from polyglot.builtins import unicode_type
 def parse_html(markup):
    from calibre.ebooks.chardet import strip_encoding_declarations, xml_to_unicode, substitute_entites
    from calibre.utils.cleantext import clean_xml_chars
    if isinstance(markup, unicode_type):
        markup = strip_encoding_declarations(markup)
        markup = substitute_entites(markup)
    else:
        markup = xml_to_unicode(markup, strip_encoding_pats=True, resolve_entities=True)[0]
    markup = clean_xml_chars(markup)
    from html5_parser.soup import parse
    return parse(markup, return_root=False)
 def prettify(soup):
    ans = soup.prettify()
    if isinstance(ans, bytes):
        ans = ans.decode('utf-8')
    return ans
 def BeautifulSoup(markup='', *a, **kw):
    return parse_html(markup)
 def BeautifulStoneSoup(markup='', *a, **kw):
    return bs4.BeautifulSoup(markup, 'xml')
--- a/ebook_converter/ebooks/init.py
+++ b/ebook_converter/ebooks/init.py
@@ -0,0 +1,248 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 '''
 Code for the conversion of ebook formats and the reading of metadata
 from various formats.
 '''
 import os, re, numbers, sys
 from calibre import prints
 from calibre.ebooks.chardet import xml_to_unicode
 from polyglot.builtins import unicode_type
 class ConversionError(Exception):
    def __init__(self, msg, only_msg=False):
        Exception.__init__(self, msg)
        self.only_msg = only_msg
 class UnknownFormatError(Exception):
    pass
 class DRMError(ValueError):
    pass
 class ParserError(ValueError):
    pass
 BOOK_EXTENSIONS = ['lrf', 'rar', 'zip', 'rtf', 'lit', 'txt', 'txtz', 'text', 'htm', 'xhtm',
                   'html', 'htmlz', 'xhtml', 'pdf', 'pdb', 'updb', 'pdr', 'prc', 'mobi', 'azw', 'doc',
                   'epub', 'fb2', 'fbz', 'djv', 'djvu', 'lrx', 'cbr', 'cbz', 'cbc', 'oebzip',
                   'rb', 'imp', 'odt', 'chm', 'tpz', 'azw1', 'pml', 'pmlz', 'mbp', 'tan', 'snb',
                   'xps', 'oxps', 'azw4', 'book', 'zbf', 'pobi', 'docx', 'docm', 'md',
                   'textile', 'markdown', 'ibook', 'ibooks', 'iba', 'azw3', 'ps', 'kepub', 'kfx', 'kpf']
 def return_raster_image(path):
    from calibre.utils.imghdr import what
    if os.access(path, os.R_OK):
        with open(path, 'rb') as f:
            raw = f.read()
        if what(None, raw) not in (None, 'svg'):
            return raw
 def extract_cover_from_embedded_svg(html, base, log):
    from calibre.ebooks.oeb.base import XPath, SVG, XLINK
    from calibre.utils.xml_parse import safe_xml_fromstring
    root = safe_xml_fromstring(html)
    svg = XPath('//svg:svg')(root)
    if len(svg) == 1 and len(svg[0]) == 1 and svg[0][0].tag == SVG('image'):
        image = svg[0][0]
        href = image.get(XLINK('href'), None)
        if href:
            path = os.path.join(base, *href.split('/'))
            return return_raster_image(path)
 def extract_calibre_cover(raw, base, log):
    from calibre.ebooks.BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup(raw)
    matches = soup.find(name=['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'span',
        'font', 'br'])
    images = soup.findAll('img', src=True)
    if matches is None and len(images) == 1 and \
            images[0].get('alt', '').lower()=='cover':
        img = images[0]
        img = os.path.join(base, *img['src'].split('/'))
        q = return_raster_image(img)
        if q is not None:
            return q
    # Look for a simple cover, i.e. a body with no text and only one <img> tag
    if matches is None:
        body = soup.find('body')
        if body is not None:
            text = u''.join(map(unicode_type, body.findAll(text=True)))
            if text.strip():
                # Body has text, abort
                return
            images = body.findAll('img', src=True)
            if len(images) == 1:
                img = os.path.join(base, *images[0]['src'].split('/'))
                return return_raster_image(img)
 def render_html_svg_workaround(path_to_html, log, width=590, height=750):
    from calibre.ebooks.oeb.base import SVG_NS
    with open(path_to_html, 'rb') as f:
        raw = f.read()
    raw = xml_to_unicode(raw, strip_encoding_pats=True)[0]
    data = None
    if SVG_NS in raw:
        try:
            data = extract_cover_from_embedded_svg(raw,
                   os.path.dirname(path_to_html), log)
        except Exception:
            pass
    if data is None:
        try:
            data = extract_calibre_cover(raw, os.path.dirname(path_to_html), log)
        except Exception:
            pass
    if data is None:
        data = render_html_data(path_to_html, width, height)
    return data
 def render_html_data(path_to_html, width, height):
    from calibre.ptempfile import TemporaryDirectory
    from calibre.utils.ipc.simple_worker import fork_job, WorkerError
    result = {}
    def report_error(text=''):
        prints('Failed to render', path_to_html, 'with errors:', file=sys.stderr)
        if text:
            prints(text, file=sys.stderr)
        if result and result['stdout_stderr']:
            with open(result['stdout_stderr'], 'rb') as f:
                prints(f.read(), file=sys.stderr)
    with TemporaryDirectory('-render-html') as tdir:
        try:
            result = fork_job('calibre.ebooks.render_html', 'main', args=(path_to_html, tdir, 'jpeg'))
        except WorkerError as e:
            report_error(e.orig_tb)
        else:
            if result['result']:
                with open(os.path.join(tdir, 'rendered.jpeg'), 'rb') as f:
                    return f.read()
            else:
                report_error()
 def check_ebook_format(stream, current_guess):
    ans = current_guess
    if current_guess.lower() in ('prc', 'mobi', 'azw', 'azw1', 'azw3'):
        stream.seek(0)
        if stream.read(3) == b'TPZ':
            ans = 'tpz'
        stream.seek(0)
    return ans
 def normalize(x):
    if isinstance(x, unicode_type):
        import unicodedata
        x = unicodedata.normalize('NFC', x)
    return x
 def calibre_cover(title, author_string, series_string=None,
        output_format='jpg', title_size=46, author_size=36, logo_path=None):
    title = normalize(title)
    author_string = normalize(author_string)
    series_string = normalize(series_string)
    from calibre.ebooks.covers import calibre_cover2
    from calibre.utils.img import image_to_data
    ans = calibre_cover2(title, author_string or '', series_string or '', logo_path=logo_path, as_qimage=True)
    return image_to_data(ans, fmt=output_format)
 UNIT_RE = re.compile(r'^(-*[0-9]*[.]?[0-9]*)\s*(%|em|ex|en|px|mm|cm|in|pt|pc|rem|q)$')
 def unit_convert(value, base, font, dpi, body_font_size=12):
    ' Return value in pts'
    if isinstance(value, numbers.Number):
        return value
    try:
        return float(value) * 72.0 / dpi
    except:
        pass
    result = value
    m = UNIT_RE.match(value)
    if m is not None and m.group(1):
        value = float(m.group(1))
        unit = m.group(2)
        if unit == '%':
            result = (value / 100.0) * base
        elif unit == 'px':
            result = value * 72.0 / dpi
        elif unit == 'in':
            result = value * 72.0
        elif unit == 'pt':
            result = value
        elif unit == 'em':
            result = value * font
        elif unit in ('ex', 'en'):
            # This is a hack for ex since we have no way to know
            # the x-height of the font
            font = font
            result = value * font * 0.5
        elif unit == 'pc':
            result = value * 12.0
        elif unit == 'mm':
            result = value * 2.8346456693
        elif unit == 'cm':
            result = value * 28.346456693
        elif unit == 'rem':
            result = value * body_font_size
        elif unit == 'q':
            result = value * 0.708661417325
    return result
 def parse_css_length(value):
    try:
        m = UNIT_RE.match(value)
    except TypeError:
        return None, None
    if m is not None and m.group(1):
        value = float(m.group(1))
        unit = m.group(2)
        return value, unit.lower()
    return None, None
 def generate_masthead(title, output_path=None, width=600, height=60):
    from calibre.ebooks.conversion.config import load_defaults
    recs = load_defaults('mobi_output')
    masthead_font_family = recs.get('masthead_font', None)
    from calibre.ebooks.covers import generate_masthead
    return generate_masthead(title, output_path=output_path, width=width, height=height, font_family=masthead_font_family)
 def escape_xpath_attr(value):
    if '"' in value:
        if "'" in value:
            parts = re.split('("+)', value)
            ans = []
            for x in parts:
                if x:
                    q = "'" if '"' in x else '"'
                    ans.append(q + x + q)
            return 'concat(%s)' % ', '.join(ans)
        else:
            return "'%s'" % value
    return '"%s"' % value
--- a/ebook_converter/ebooks/chardet.py
+++ b/ebook_converter/ebooks/chardet.py
@@ -0,0 +1,189 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import re, codecs
 from polyglot.builtins import unicode_type
 _encoding_pats = (
    # XML declaration
    r'<\?[^<>]+encoding\s*=\s*[\'"](.*?)[\'"][^<>]*>',
    # HTML 5 charset
    r'''<meta\s+charset=['"]([-_a-z0-9]+)['"][^<>]*>(?:\s*</meta>){0,1}''',
    # HTML 4 Pragma directive
    r'''<meta\s+?[^<>]*?content\s*=\s*['"][^'"]*?charset=([-_a-z0-9]+)[^'"]*?['"][^<>]*>(?:\s*</meta>){0,1}''',
 )
 def compile_pats(binary):
    for raw in _encoding_pats:
        if binary:
            raw = raw.encode('ascii')
        yield re.compile(raw, flags=re.IGNORECASE)
 class LazyEncodingPats(object):
    def __call__(self, binary=False):
        attr = 'binary_pats' if binary else 'unicode_pats'
        pats = getattr(self, attr, None)
        if pats is None:
            pats = tuple(compile_pats(binary))
            setattr(self, attr, pats)
        for pat in pats:
            yield pat
 lazy_encoding_pats = LazyEncodingPats()
 ENTITY_PATTERN = re.compile(r'&(\S+?);')
 def strip_encoding_declarations(raw, limit=50*1024, preserve_newlines=False):
    prefix = raw[:limit]
    suffix = raw[limit:]
    is_binary = isinstance(raw, bytes)
    if preserve_newlines:
        if is_binary:
            sub = lambda m: b'\n' * m.group().count(b'\n')
        else:
            sub = lambda m: '\n' * m.group().count('\n')
    else:
        sub = b'' if is_binary else u''
    for pat in lazy_encoding_pats(is_binary):
        prefix = pat.sub(sub, prefix)
    raw = prefix + suffix
    return raw
 def replace_encoding_declarations(raw, enc='utf-8', limit=50*1024):
    prefix = raw[:limit]
    suffix = raw[limit:]
    changed = [False]
    is_binary = isinstance(raw, bytes)
    if is_binary:
        if not isinstance(enc, bytes):
            enc = enc.encode('ascii')
    else:
        if isinstance(enc, bytes):
            enc = enc.decode('ascii')
    def sub(m):
        ans = m.group()
        if m.group(1).lower() != enc.lower():
            changed[0] = True
            start, end = m.start(1) - m.start(0), m.end(1) - m.end(0)
            ans = ans[:start] + enc + ans[end:]
        return ans
    for pat in lazy_encoding_pats(is_binary):
        prefix = pat.sub(sub, prefix)
    raw = prefix + suffix
    return raw, changed[0]
 def find_declared_encoding(raw, limit=50*1024):
    prefix = raw[:limit]
    is_binary = isinstance(raw, bytes)
    for pat in lazy_encoding_pats(is_binary):
        m = pat.search(prefix)
        if m is not None:
            ans = m.group(1)
            if is_binary:
                ans = ans.decode('ascii', 'replace')
                return ans
 def substitute_entites(raw):
    from calibre import xml_entity_to_unicode
    return ENTITY_PATTERN.sub(xml_entity_to_unicode, raw)
 _CHARSET_ALIASES = {"macintosh" : "mac-roman",
                        "x-sjis" : "shift-jis"}
 def detect(*args, **kwargs):
    from chardet import detect
    return detect(*args, **kwargs)
 def force_encoding(raw, verbose, assume_utf8=False):
    from calibre.constants import preferred_encoding
    try:
        chardet = detect(raw[:1024*50])
    except:
        chardet = {'encoding':preferred_encoding, 'confidence':0}
    encoding = chardet['encoding']
    if chardet['confidence'] < 1 and assume_utf8:
        encoding = 'utf-8'
    if chardet['confidence'] < 1 and verbose:
        print('WARNING: Encoding detection confidence for %s is %d%%'%(
            chardet['encoding'], chardet['confidence']*100))
    if not encoding:
        encoding = preferred_encoding
    encoding = encoding.lower()
    encoding = _CHARSET_ALIASES.get(encoding, encoding)
    if encoding == 'ascii':
        encoding = 'utf-8'
    return encoding
 def detect_xml_encoding(raw, verbose=False, assume_utf8=False):
    if not raw or isinstance(raw, unicode_type):
        return raw, None
    for x in ('utf8', 'utf-16-le', 'utf-16-be'):
        bom = getattr(codecs, 'BOM_'+x.upper().replace('-16', '16').replace(
            '-', '_'))
        if raw.startswith(bom):
            return raw[len(bom):], x
    encoding = None
    for pat in lazy_encoding_pats(True):
        match = pat.search(raw)
        if match:
            encoding = match.group(1)
            encoding = encoding.decode('ascii', 'replace')
            break
    if encoding is None:
        encoding = force_encoding(raw, verbose, assume_utf8=assume_utf8)
    if encoding.lower().strip() == 'macintosh':
        encoding = 'mac-roman'
    if encoding.lower().replace('_', '-').strip() in (
            'gb2312', 'chinese', 'csiso58gb231280', 'euc-cn', 'euccn',
            'eucgb2312-cn', 'gb2312-1980', 'gb2312-80', 'iso-ir-58'):
        # Microsoft Word exports to HTML with encoding incorrectly set to
        # gb2312 instead of gbk. gbk is a superset of gb2312, anyway.
        encoding = 'gbk'
    try:
        codecs.lookup(encoding)
    except LookupError:
        encoding = 'utf-8'
    return raw, encoding
 def xml_to_unicode(raw, verbose=False, strip_encoding_pats=False,
                   resolve_entities=False, assume_utf8=False):
    '''
    Force conversion of byte string to unicode. Tries to look for XML/HTML
    encoding declaration first, if not found uses the chardet library and
    prints a warning if detection confidence is < 100%
    @return: (unicode, encoding used)
    '''
    if not raw:
        return '', None
    raw, encoding = detect_xml_encoding(raw, verbose=verbose,
            assume_utf8=assume_utf8)
    if not isinstance(raw, unicode_type):
        raw = raw.decode(encoding, 'replace')
    if strip_encoding_pats:
        raw = strip_encoding_declarations(raw)
    if resolve_entities:
        raw = substitute_entites(raw)
    return raw, encoding
--- a/ebook_converter/ebooks/compression/init.py
+++ b/ebook_converter/ebooks/compression/init.py
@@ -0,0 +1,6 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
--- a/ebook_converter/ebooks/compression/palmdoc.c
+++ b/ebook_converter/ebooks/compression/palmdoc.c
@@ -0,0 +1,238 @@
 /*
 :mod:`cPalmdoc` -- Palmdoc compression/decompression
 =====================================================
 .. module:: cPalmdoc
    :platform: All
    :synopsis: Compression decompression of Palmdoc implemented in C for speed
 .. moduleauthor:: Kovid Goyal <kovid@kovidgoyal.net> Copyright 2009
 */
 #define PY_SSIZE_T_CLEAN
 #include <Python.h>
 #include <stdio.h>
 #define BUFFER 6000
 #define MIN(x, y) ( ((x) < (y)) ? (x) : (y) )
 #define MAX(x, y) ( ((x) > (y)) ? (x) : (y) )
 typedef unsigned short int Byte;
 typedef struct {
 	Byte	*data;
 	Py_ssize_t len;
 } buffer;
 #ifdef	bool
 #undef	bool
 #endif
 #define	bool		int
 #ifdef	false
 #undef	false
 #endif
 #define	false		0
 #ifdef	true
 #undef	true
 #endif
 #define	true		1
 #define CHAR(x) (( (x) > 127 ) ? (x)-256 : (x))
 #if PY_MAJOR_VERSION >= 3
    #define BUFFER_FMT "y#"
    #define BYTES_FMT "y#"
 #else
    #define BUFFER_FMT "t#"
    #define BYTES_FMT "s#"
 #endif
 static PyObject *
 cpalmdoc_decompress(PyObject *self, PyObject *args) {
    const char *_input = NULL; Py_ssize_t input_len = 0;
    Byte *input; char *output; Byte c; PyObject *ans;
    Py_ssize_t i = 0, o = 0, j = 0, di, n;
    if (!PyArg_ParseTuple(args, BUFFER_FMT, &_input, &input_len))
 		return NULL;
    input = (Byte *) PyMem_Malloc(sizeof(Byte)*input_len);
    if (input == NULL) return PyErr_NoMemory();
    // Map chars to bytes
    for (j = 0; j < input_len; j++)
        input[j] = (_input[j] < 0) ? _input[j]+256 : _input[j];
    output = (char *)PyMem_Malloc(sizeof(char)*(MAX(BUFFER, 8*input_len)));
    if (output == NULL) return PyErr_NoMemory();
    while (i < input_len) {
        c = input[i++];
        if (c >= 1 && c <= 8)  // copy 'c' bytes
            while (c--) output[o++] = (char)input[i++];
        else if (c <= 0x7F)  // 0, 09-7F = self
            output[o++] = (char)c;
        else if (c >= 0xC0) { // space + ASCII char
            output[o++] = ' ';
            output[o++] = c ^ 0x80;
        }
        else { // 80-BF repeat sequences
            c = (c << 8) + input[i++];
            di = (c & 0x3FFF) >> 3;
            for ( n = (c & 7) + 3; n--; ++o )
                output[o] = output[o - di];
        }
    }
    ans = Py_BuildValue(BYTES_FMT, output, o);
    if (output != NULL) PyMem_Free(output);
    if (input != NULL) PyMem_Free(input);
    return ans;
 }
 static bool
 cpalmdoc_memcmp( Byte *a, Byte *b, Py_ssize_t len) {
    Py_ssize_t i;
    for (i = 0; i < len; i++) if (a[i] != b[i]) return false;
    return true;
 }
 static Py_ssize_t
 cpalmdoc_rfind(Byte *data, Py_ssize_t pos, Py_ssize_t chunk_length) {
    Py_ssize_t i;
    for (i = pos - chunk_length; i > -1; i--)
        if (cpalmdoc_memcmp(data+i, data+pos, chunk_length)) return i;
    return pos;
 }
 static Py_ssize_t
 cpalmdoc_do_compress(buffer *b, char *output) {
    Py_ssize_t i = 0, j, chunk_len, dist;
    unsigned int compound;
    Byte c, n;
    bool found;
    char *head;
    buffer temp;
    head = output;
    temp.data = (Byte *)PyMem_Malloc(sizeof(Byte)*8); temp.len = 0;
    if (temp.data == NULL) return 0;
    while (i < b->len) {
        c = b->data[i];
        //do repeats
        if ( i > 10 && (b->len - i) > 10) {
            found = false;
            for (chunk_len = 10; chunk_len > 2; chunk_len--) {
                j = cpalmdoc_rfind(b->data, i, chunk_len);
                dist = i - j;
                if (j < i && dist <= 2047) {
                    found = true;
                    compound = (unsigned int)((dist << 3) + chunk_len-3);
                    *(output++) = CHAR(0x80 + (compound >> 8 ));
                    *(output++) = CHAR(compound & 0xFF);
                    i += chunk_len;
                    break;
                }
            }
            if (found) continue;
        }
        //write single character
        i++;
        if (c == 32 && i < b->len) {
            n = b->data[i];
            if ( n >= 0x40 && n <= 0x7F) {
                *(output++) = CHAR(n^0x80); i++; continue;
            }
        }
        if (c == 0 || (c > 8 && c < 0x80))
            *(output++) = CHAR(c);
        else { // Write binary data
            j = i;
            temp.data[0] = c; temp.len = 1;
            while (j < b->len && temp.len < 8) {
                c = b->data[j];
                if (c == 0 || (c > 8 && c < 0x80)) break;
                temp.data[temp.len++] = c; j++;
            }
            i += temp.len - 1;
            *(output++) = (char)temp.len;
            for (j=0; j < temp.len; j++) *(output++) = (char)temp.data[j];
        }
    }
    PyMem_Free(temp.data);
    return output - head;
 }
 static PyObject *
 cpalmdoc_compress(PyObject *self, PyObject *args) {
    const char *_input = NULL; Py_ssize_t input_len = 0;
    char *output; PyObject *ans;
    Py_ssize_t j = 0;
    buffer b;
    if (!PyArg_ParseTuple(args, BUFFER_FMT, &_input, &input_len))
 		return NULL;
    b.data = (Byte *)PyMem_Malloc(sizeof(Byte)*input_len);
    if (b.data == NULL) return PyErr_NoMemory();
    // Map chars to bytes
    for (j = 0; j < input_len; j++)
        b.data[j] = (_input[j] < 0) ? _input[j]+256 : _input[j];
    b.len = input_len;
    // Make the output buffer larger than the input as sometimes
    // compression results in a larger block
    output = (char *)PyMem_Malloc(sizeof(char) * (int)(1.25*b.len));
    if (output == NULL) return PyErr_NoMemory();
    j = cpalmdoc_do_compress(&b, output);
    if ( j == 0) return PyErr_NoMemory();
    ans = Py_BuildValue(BYTES_FMT, output, j);
    PyMem_Free(output);
    PyMem_Free(b.data);
    return ans;
 }
 static char cPalmdoc_doc[] = "Compress and decompress palmdoc strings.";
 static PyMethodDef cPalmdoc_methods[] = {
    {"decompress", cpalmdoc_decompress, METH_VARARGS,
    "decompress(bytestring) -> decompressed bytestring\n\n"
    		"Decompress a palmdoc compressed byte string. "
    },
    {"compress", cpalmdoc_compress, METH_VARARGS,
    "compress(bytestring) -> compressed bytestring\n\n"
    		"Palmdoc compress a byte string. "
    },
    {NULL, NULL, 0, NULL}
 };
 #if PY_MAJOR_VERSION >= 3
 #define INITERROR return NULL
 #define INITMODULE PyModule_Create(&cPalmdoc_module)
 static struct PyModuleDef cPalmdoc_module = {
    /* m_base     */ PyModuleDef_HEAD_INIT,
    /* m_name     */ "cPalmdoc",
    /* m_doc      */ cPalmdoc_doc,
    /* m_size     */ -1,
    /* m_methods  */ cPalmdoc_methods,
    /* m_slots    */ 0,
    /* m_traverse */ 0,
    /* m_clear    */ 0,
    /* m_free     */ 0,
 };
 CALIBRE_MODINIT_FUNC PyInit_cPalmdoc(void) {
 #else
 #define INITERROR return
 #define INITMODULE Py_InitModule3("cPalmdoc", cPalmdoc_methods, cPalmdoc_doc)
 CALIBRE_MODINIT_FUNC initcPalmdoc(void) {
 #endif
    PyObject *m;
    m = INITMODULE;
    if (m == NULL) {
        INITERROR;
    }
 #if PY_MAJOR_VERSION >= 3
    return m;
 #endif
 }
--- a/ebook_converter/ebooks/compression/palmdoc.py
+++ b/ebook_converter/ebooks/compression/palmdoc.py
@@ -0,0 +1,96 @@
 #!/usr/bin/env  python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import io
 from struct import pack
 from calibre.constants import plugins
 from polyglot.builtins import range
 cPalmdoc = plugins['cPalmdoc'][0]
 if not cPalmdoc:
    raise RuntimeError(('Failed to load required cPalmdoc module: '
            '%s')%plugins['cPalmdoc'][1])
 def decompress_doc(data):
    return cPalmdoc.decompress(data)
 def compress_doc(data):
    return cPalmdoc.compress(data) if data else b''
 def py_compress_doc(data):
    out = io.BytesIO()
    i = 0
    ldata = len(data)
    while i < ldata:
        if i > 10 and (ldata - i) > 10:
            chunk = b''
            match = -1
            for j in range(10, 2, -1):
                chunk = data[i:i+j]
                try:
                    match = data.rindex(chunk, 0, i)
                except ValueError:
                    continue
                if (i - match) <= 2047:
                    break
                match = -1
            if match >= 0:
                n = len(chunk)
                m = i - match
                code = 0x8000 + ((m << 3) & 0x3ff8) + (n - 3)
                out.write(pack('>H', code))
                i += n
                continue
        ch = data[i:i+1]
        och = ord(ch)
        i += 1
        if ch == b' ' and (i + 1) < ldata:
            onch = ord(data[i:i+1])
            if onch >= 0x40 and onch < 0x80:
                out.write(pack('>B', onch ^ 0x80))
                i += 1
                continue
        if och == 0 or (och > 8 and och < 0x80):
            out.write(ch)
        else:
            j = i
            binseq = [ch]
            while j < ldata and len(binseq) < 8:
                ch = data[j:j+1]
                och = ord(ch)
                if och == 0 or (och > 8 and och < 0x80):
                    break
                binseq.append(ch)
                j += 1
            out.write(pack('>B', len(binseq)))
            out.write(b''.join(binseq))
            i += len(binseq) - 1
    return out.getvalue()
 def find_tests():
    import unittest
    class Test(unittest.TestCase):
        def test_palmdoc_compression(self):
            for test in [
                b'abc\x03\x04\x05\x06ms',  # Test binary writing
                b'a b c \xfed ',  # Test encoding of spaces
                b'0123456789axyz2bxyz2cdfgfo9iuyerh',
                b'0123456789asd0123456789asd|yyzzxxffhhjjkk',
                (b'ciewacnaq eiu743 r787q 0w%  ; sa fd\xef\ffdxosac wocjp acoiecowei '
                b'owaic jociowapjcivcjpoivjporeivjpoavca; p9aw8743y6r74%$^$^%8 ')
            ]:
                x = compress_doc(test)
                self.assertEqual(py_compress_doc(test), x)
                self.assertEqual(decompress_doc(x), test)
    return unittest.defaultTestLoader.loadTestsFromTestCase(Test)
--- a/ebook_converter/ebooks/conversion/init.py
+++ b/ebook_converter/ebooks/conversion/init.py
@@ -0,0 +1,30 @@
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 from polyglot.builtins import native_string_type
 class ConversionUserFeedBack(Exception):
    def __init__(self, title, msg, level='info', det_msg=''):
        ''' Show a simple message to the user
        :param title: The title (very short description)
        :param msg: The message to show the user
        :param level: Must be one of 'info', 'warn' or 'error'
        :param det_msg: Optional detailed message to show the user
        '''
        import json
        Exception.__init__(self, json.dumps({'msg':msg, 'level':level,
            'det_msg':det_msg, 'title':title}))
        self.title, self.msg, self.det_msg = title, msg, det_msg
        self.level = level
 # Ensure exception uses fully qualified name as this is used to detect it in
 # the GUI.
 ConversionUserFeedBack.__name__ = native_string_type('calibre.ebooks.conversion.ConversionUserFeedBack')
--- a/ebook_converter/ebooks/conversion/cli.py
+++ b/ebook_converter/ebooks/conversion/cli.py
@@ -0,0 +1,428 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 '''
 Command line interface to conversion sub-system
 '''
 import sys, os, numbers
 from optparse import OptionGroup, Option
 from collections import OrderedDict
 from calibre.utils.config import OptionParser
 from calibre.utils.logging import Log
 from calibre.customize.conversion import OptionRecommendation
 from calibre import patheq
 from calibre.ebooks.conversion import ConversionUserFeedBack
 from calibre.utils.localization import localize_user_manual_link
 from polyglot.builtins import iteritems
 USAGE = '%prog ' + _('''\
 input_file output_file [options]
 Convert an e-book from one format to another.
 input_file is the input and output_file is the output. Both must be \
 specified as the first two arguments to the command.
 The output e-book format is guessed from the file extension of \
 output_file. output_file can also be of the special format .EXT where \
 EXT is the output file extension. In this case, the name of the output \
 file is derived from the name of the input file. Note that the filenames must \
 not start with a hyphen. Finally, if output_file has no extension, then \
 it is treated as a directory and an "open e-book" (OEB) consisting of HTML \
 files is written to that directory. These files are the files that would \
 normally have been passed to the output plugin.
 After specifying the input \
 and output file you can customize the conversion by specifying various \
 options. The available options depend on the input and output file types. \
 To get help on them specify the input and output file and then use the -h \
 option.
 For full documentation of the conversion system see
 ''') + localize_user_manual_link('https://manual.calibre-ebook.com/conversion.html')
 HEURISTIC_OPTIONS = ['markup_chapter_headings',
                      'italicize_common_cases', 'fix_indents',
                      'html_unwrap_factor', 'unwrap_lines',
                      'delete_blank_paragraphs', 'format_scene_breaks',
                      'dehyphenate', 'renumber_headings',
                      'replace_scene_breaks']
 DEFAULT_TRUE_OPTIONS = HEURISTIC_OPTIONS + ['remove_fake_margins']
 def print_help(parser, log):
    parser.print_help()
 def check_command_line_options(parser, args, log):
    if len(args) < 3 or args[1].startswith('-') or args[2].startswith('-'):
        print_help(parser, log)
        log.error('\n\nYou must specify the input AND output files')
        raise SystemExit(1)
    input = os.path.abspath(args[1])
    if not input.endswith('.recipe') and not os.access(input, os.R_OK) and not \
            ('-h' in args or '--help' in args):
        log.error('Cannot read from', input)
        raise SystemExit(1)
    if input.endswith('.recipe') and not os.access(input, os.R_OK):
        input = args[1]
    output = args[2]
    if (output.startswith('.') and output[:2] not in {'..', '.'} and '/' not in
            output and '\\' not in output):
        output = os.path.splitext(os.path.basename(input))[0]+output
    output = os.path.abspath(output)
    return input, output
 def option_recommendation_to_cli_option(add_option, rec):
    opt = rec.option
    switches = ['-'+opt.short_switch] if opt.short_switch else []
    switches.append('--'+opt.long_switch)
    attrs = dict(dest=opt.name, help=opt.help,
                     choices=opt.choices, default=rec.recommended_value)
    if isinstance(rec.recommended_value, type(True)):
        attrs['action'] = 'store_false' if rec.recommended_value else \
                          'store_true'
    else:
        if isinstance(rec.recommended_value, numbers.Integral):
            attrs['type'] = 'int'
        if isinstance(rec.recommended_value, numbers.Real):
            attrs['type'] = 'float'
    if opt.long_switch == 'verbose':
        attrs['action'] = 'count'
        attrs.pop('type', '')
    if opt.name == 'read_metadata_from_opf':
        switches.append('--from-opf')
    if opt.name == 'transform_css_rules':
        attrs['help'] = _(
            'Path to a file containing rules to transform the CSS styles'
            ' in this book. The easiest way to create such a file is to'
            ' use the wizard for creating rules in the calibre GUI. Access'
            ' it in the "Look & feel->Transform styles" section of the conversion'
            ' dialog. Once you create the rules, you can use the "Export" button'
            ' to save them to a file.'
        )
    if opt.name in DEFAULT_TRUE_OPTIONS and rec.recommended_value is True:
        switches = ['--disable-'+opt.long_switch]
    add_option(Option(*switches, **attrs))
 def group_titles():
    return _('INPUT OPTIONS'), _('OUTPUT OPTIONS')
 def recipe_test(option, opt_str, value, parser):
    assert value is None
    value = []
    def floatable(s):
        try:
            float(s)
            return True
        except ValueError:
            return False
    for arg in parser.rargs:
        # stop on --foo like options
        if arg[:2] == "--":
            break
        # stop on -a, but not on -3 or -3.0
        if arg[:1] == "-" and len(arg) > 1 and not floatable(arg):
            break
        try:
            value.append(int(arg))
        except (TypeError, ValueError, AttributeError):
            break
        if len(value) == 2:
            break
    del parser.rargs[:len(value)]
    while len(value) < 2:
        value.append(2)
    setattr(parser.values, option.dest, tuple(value))
 def add_input_output_options(parser, plumber):
    input_options, output_options = \
                                plumber.input_options, plumber.output_options
    def add_options(group, options):
        for opt in options:
            if plumber.input_fmt == 'recipe' and opt.option.long_switch == 'test':
                group(Option('--test', dest='test', action='callback', callback=recipe_test))
            else:
                option_recommendation_to_cli_option(group, opt)
    if input_options:
        title = group_titles()[0]
        io = OptionGroup(parser, title, _('Options to control the processing'
                          ' of the input %s file')%plumber.input_fmt)
        add_options(io.add_option, input_options)
        parser.add_option_group(io)
    if output_options:
        title = group_titles()[1]
        oo = OptionGroup(parser, title, _('Options to control the processing'
                          ' of the output %s')%plumber.output_fmt)
        add_options(oo.add_option, output_options)
        parser.add_option_group(oo)
 def add_pipeline_options(parser, plumber):
    groups = OrderedDict((
              ('' , ('',
                    [
                     'input_profile',
                     'output_profile',
                     ]
                    )),
              (_('LOOK AND FEEL') , (
                  _('Options to control the look and feel of the output'),
                  [
                      'base_font_size', 'disable_font_rescaling',
                      'font_size_mapping', 'embed_font_family',
                      'subset_embedded_fonts', 'embed_all_fonts',
                      'line_height', 'minimum_line_height',
                      'linearize_tables',
                      'extra_css', 'filter_css', 'transform_css_rules', 'expand_css',
                      'smarten_punctuation', 'unsmarten_punctuation',
                      'margin_top', 'margin_left', 'margin_right',
                      'margin_bottom', 'change_justification',
                      'insert_blank_line', 'insert_blank_line_size',
                      'remove_paragraph_spacing',
                      'remove_paragraph_spacing_indent_size',
                      'asciiize', 'keep_ligatures',
                  ]
                  )),
              (_('HEURISTIC PROCESSING') , (
                  _('Modify the document text and structure using common'
                     ' patterns. Disabled by default. Use %(en)s to enable. '
                     ' Individual actions can be disabled with the %(dis)s options.')
                  % dict(en='--enable-heuristics', dis='--disable-*'),
                  ['enable_heuristics'] + HEURISTIC_OPTIONS
                  )),
              (_('SEARCH AND REPLACE') , (
                 _('Modify the document text and structure using user defined patterns.'),
                 [
                     'sr1_search', 'sr1_replace',
                     'sr2_search', 'sr2_replace',
                     'sr3_search', 'sr3_replace',
                     'search_replace',
                 ]
              )),
              (_('STRUCTURE DETECTION') , (
                  _('Control auto-detection of document structure.'),
                  [
                      'chapter', 'chapter_mark',
                      'prefer_metadata_cover', 'remove_first_image',
                      'insert_metadata', 'page_breaks_before',
                      'remove_fake_margins', 'start_reading_at',
                  ]
                  )),
              (_('TABLE OF CONTENTS') , (
                  _('Control the automatic generation of a Table of Contents. By '
                  'default, if the source file has a Table of Contents, it will '
                  'be used in preference to the automatically generated one.'),
                  [
                    'level1_toc', 'level2_toc', 'level3_toc',
                    'toc_threshold', 'max_toc_links', 'no_chapters_in_toc',
                    'use_auto_toc', 'toc_filter', 'duplicate_links_in_toc',
                  ]
                  )),
              (_('METADATA') , (_('Options to set metadata in the output'),
                            plumber.metadata_option_names + ['read_metadata_from_opf'],
                            )),
              (_('DEBUG'), (_('Options to help with debugging the conversion'),
                        [
                         'verbose',
                         'debug_pipeline',
                         ])),
              ))
    for group, (desc, options) in iteritems(groups):
        if group:
            group = OptionGroup(parser, group, desc)
            parser.add_option_group(group)
        add_option = group.add_option if group != '' else parser.add_option
        for name in options:
            rec = plumber.get_option_by_name(name)
            if rec.level < rec.HIGH:
                option_recommendation_to_cli_option(add_option, rec)
 def option_parser():
    parser = OptionParser(usage=USAGE)
    parser.add_option('--list-recipes', default=False, action='store_true',
            help=_('List builtin recipe names. You can create an e-book from '
                'a builtin recipe like this: ebook-convert "Recipe Name.recipe" '
                'output.epub'))
    return parser
 class ProgressBar(object):
    def __init__(self, log):
        self.log = log
    def __call__(self, frac, msg=''):
        if msg:
            percent = int(frac*100)
            self.log('%d%% %s'%(percent, msg))
 def create_option_parser(args, log):
    if '--version' in args:
        from calibre.constants import __appname__, __version__, __author__
        log(os.path.basename(args[0]), '('+__appname__, __version__+')')
        log('Created by:', __author__)
        raise SystemExit(0)
    if '--list-recipes' in args:
        from calibre.web.feeds.recipes.collection import get_builtin_recipe_titles
        log('Available recipes:')
        titles = sorted(get_builtin_recipe_titles())
        for title in titles:
            try:
                log('\t'+title)
            except:
                log('\t'+repr(title))
        log('%d recipes available'%len(titles))
        raise SystemExit(0)
    parser = option_parser()
    if len(args) < 3:
        print_help(parser, log)
        if any(x in args for x in ('-h', '--help')):
            raise SystemExit(0)
        else:
            raise SystemExit(1)
    input, output = check_command_line_options(parser, args, log)
    from calibre.ebooks.conversion.plumber import Plumber
    reporter = ProgressBar(log)
    if patheq(input, output):
        raise ValueError('Input file is the same as the output file')
    plumber = Plumber(input, output, log, reporter)
    add_input_output_options(parser, plumber)
    add_pipeline_options(parser, plumber)
    return parser, plumber
 def abspath(x):
    if x.startswith('http:') or x.startswith('https:'):
        return x
    return os.path.abspath(os.path.expanduser(x))
 def escape_sr_pattern(exp):
    return exp.replace('\n', '\ue123')
 def read_sr_patterns(path, log=None):
    import json, re
    pats = []
    with open(path, 'rb') as f:
        lines = f.read().decode('utf-8').splitlines()
    pat = None
    for line in lines:
        if pat is None:
            if not line.strip():
                continue
            line = line.replace('\ue123', '\n')
            try:
                re.compile(line)
            except:
                msg = 'Invalid regular expression: %r from file: %r'%(
                        line, path)
                if log is not None:
                    log.error(msg)
                    raise SystemExit(1)
                else:
                    raise ValueError(msg)
            pat = line
        else:
            pats.append((pat, line))
            pat = None
    return json.dumps(pats)
 def main(args=sys.argv):
    log = Log()
    parser, plumber = create_option_parser(args, log)
    opts, leftover_args = parser.parse_args(args)
    if len(leftover_args) > 3:
        log.error('Extra arguments not understood:', u', '.join(leftover_args[3:]))
        return 1
    for x in ('read_metadata_from_opf', 'cover'):
        if getattr(opts, x, None) is not None:
            setattr(opts, x, abspath(getattr(opts, x)))
    if opts.search_replace:
        opts.search_replace = read_sr_patterns(opts.search_replace, log)
    if opts.transform_css_rules:
        from calibre.ebooks.css_transform_rules import import_rules, validate_rule
        with open(opts.transform_css_rules, 'rb') as tcr:
            opts.transform_css_rules = rules = list(import_rules(tcr.read()))
            for rule in rules:
                title, msg = validate_rule(rule)
                if title and msg:
                    log.error('Failed to parse CSS transform rules')
                    log.error(title)
                    log.error(msg)
                    return 1
    recommendations = [(n.dest, getattr(opts, n.dest),
                        OptionRecommendation.HIGH)
                                        for n in parser.options_iter()
                                        if n.dest]
    plumber.merge_ui_recommendations(recommendations)
    try:
        plumber.run()
    except ConversionUserFeedBack as e:
        ll = {'info': log.info, 'warn': log.warn,
                'error':log.error}.get(e.level, log.info)
        ll(e.title)
        if e.det_msg:
            log.debug(e.detmsg)
        ll(e.msg)
        raise SystemExit(1)
    log(_('Output saved to'), ' ', plumber.output)
    return 0
 def manual_index_strings():
    return _('''\
 The options and default values for the options change depending on both the
 input and output formats, so you should always check with::
    %s
 Below are the options that are common to all conversion, followed by the
 options specific to every input and output format.''')
 if __name__ == '__main__':
    sys.exit(main())
--- a/ebook_converter/ebooks/conversion/plugins/init.py
+++ b/ebook_converter/ebooks/conversion/plugins/init.py
@@ -0,0 +1,10 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2012, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
--- a/ebook_converter/ebooks/conversion/plugins/azw4_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/azw4_input.py
@@ -0,0 +1,29 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2011, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import getcwd
 class AZW4Input(InputFormatPlugin):
    name        = 'AZW4 Input'
    author      = 'John Schember'
    description = 'Convert AZW4 to HTML'
    file_types  = {'azw4'}
    commit_name = 'azw4_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.pdb.header import PdbHeaderReader
        from calibre.ebooks.azw4.reader import Reader
        header = PdbHeaderReader(stream)
        reader = Reader(header, stream, log, options)
        opf = reader.extract_content(getcwd())
        return opf
--- a/ebook_converter/ebooks/conversion/plugins/chm_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/chm_input.py
@@ -0,0 +1,202 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 ''' CHM File decoding support '''
 __license__ = 'GPL v3'
 __copyright__  = '2008, Kovid Goyal <kovid at kovidgoyal.net>,' \
                 ' and Alex Bramley <a.bramley at gmail.com>.'
 import os
 from calibre.customize.conversion import InputFormatPlugin
 from calibre.ptempfile import TemporaryDirectory
 from calibre.constants import filesystem_encoding
 from polyglot.builtins import unicode_type, as_bytes
 class CHMInput(InputFormatPlugin):
    name        = 'CHM Input'
    author      = 'Kovid Goyal and Alex Bramley'
    description = 'Convert CHM files to OEB'
    file_types  = {'chm'}
    commit_name = 'chm_input'
    def _chmtohtml(self, output_dir, chm_path, no_images, log, debug_dump=False):
        from calibre.ebooks.chm.reader import CHMReader
        log.debug('Opening CHM file')
        rdr = CHMReader(chm_path, log, input_encoding=self.opts.input_encoding)
        log.debug('Extracting CHM to %s' % output_dir)
        rdr.extract_content(output_dir, debug_dump=debug_dump)
        self._chm_reader = rdr
        return rdr.hhc_path
    def convert(self, stream, options, file_ext, log, accelerators):
        from calibre.ebooks.chm.metadata import get_metadata_from_reader
        from calibre.customize.ui import plugin_for_input_format
        self.opts = options
        log.debug('Processing CHM...')
        with TemporaryDirectory('_chm2oeb') as tdir:
            if not isinstance(tdir, unicode_type):
                tdir = tdir.decode(filesystem_encoding)
            html_input = plugin_for_input_format('html')
            for opt in html_input.options:
                setattr(options, opt.option.name, opt.recommended_value)
            no_images = False  # options.no_images
            chm_name = stream.name
            # chm_data = stream.read()
            # closing stream so CHM can be opened by external library
            stream.close()
            log.debug('tdir=%s' % tdir)
            log.debug('stream.name=%s' % stream.name)
            debug_dump = False
            odi = options.debug_pipeline
            if odi:
                debug_dump = os.path.join(odi, 'input')
            mainname = self._chmtohtml(tdir, chm_name, no_images, log,
                    debug_dump=debug_dump)
            mainpath = os.path.join(tdir, mainname)
            try:
                metadata = get_metadata_from_reader(self._chm_reader)
            except Exception:
                log.exception('Failed to read metadata, using filename')
                from calibre.ebooks.metadata.book.base import Metadata
                metadata = Metadata(os.path.basename(chm_name))
            encoding = self._chm_reader.get_encoding() or options.input_encoding or 'cp1252'
            self._chm_reader.CloseCHM()
            # print((tdir, mainpath))
            # from calibre import ipython
            # ipython()
            options.debug_pipeline = None
            options.input_encoding = 'utf-8'
            uenc = encoding
            if os.path.abspath(mainpath) in self._chm_reader.re_encoded_files:
                uenc = 'utf-8'
            htmlpath, toc = self._create_html_root(mainpath, log, uenc)
            oeb = self._create_oebbook_html(htmlpath, tdir, options, log, metadata)
            options.debug_pipeline = odi
            if toc.count() > 1:
                oeb.toc = self.parse_html_toc(oeb.spine[0])
                oeb.manifest.remove(oeb.spine[0])
                oeb.auto_generated_toc = False
        return oeb
    def parse_html_toc(self, item):
        from calibre.ebooks.oeb.base import TOC, XPath
        dx = XPath('./h:div')
        ax = XPath('./h:a[1]')
        def do_node(parent, div):
            for child in dx(div):
                a = ax(child)[0]
                c = parent.add(a.text, a.attrib['href'])
                do_node(c, child)
        toc = TOC()
        root = XPath('//h:div[1]')(item.data)[0]
        do_node(toc, root)
        return toc
    def _create_oebbook_html(self, htmlpath, basedir, opts, log, mi):
        # use HTMLInput plugin to generate book
        from calibre.customize.builtins import HTMLInput
        opts.breadth_first = True
        htmlinput = HTMLInput(None)
        oeb = htmlinput.create_oebbook(htmlpath, basedir, opts, log, mi)
        return oeb
    def _create_html_root(self, hhcpath, log, encoding):
        from lxml import html
        from polyglot.urllib import unquote as _unquote
        from calibre.ebooks.oeb.base import urlquote
        from calibre.ebooks.chardet import xml_to_unicode
        hhcdata = self._read_file(hhcpath)
        hhcdata = hhcdata.decode(encoding)
        hhcdata = xml_to_unicode(hhcdata, verbose=True,
                            strip_encoding_pats=True, resolve_entities=True)[0]
        hhcroot = html.fromstring(hhcdata)
        toc = self._process_nodes(hhcroot)
        # print("=============================")
        # print("Printing hhcroot")
        # print(etree.tostring(hhcroot, pretty_print=True))
        # print("=============================")
        log.debug('Found %d section nodes' % toc.count())
        htmlpath = os.path.splitext(hhcpath)[0] + ".html"
        base = os.path.dirname(os.path.abspath(htmlpath))
        def unquote(x):
            if isinstance(x, unicode_type):
                x = x.encode('utf-8')
            return _unquote(x).decode('utf-8')
        def unquote_path(x):
            y = unquote(x)
            if (not os.path.exists(os.path.join(base, x)) and os.path.exists(os.path.join(base, y))):
                x = y
            return x
        def donode(item, parent, base, subpath):
            for child in item:
                title = child.title
                if not title:
                    continue
                raw = unquote_path(child.href or '')
                rsrcname = os.path.basename(raw)
                rsrcpath = os.path.join(subpath, rsrcname)
                if (not os.path.exists(os.path.join(base, rsrcpath)) and os.path.exists(os.path.join(base, raw))):
                    rsrcpath = raw
                if '%' not in rsrcpath:
                    rsrcpath = urlquote(rsrcpath)
                if not raw:
                    rsrcpath = ''
                c = DIV(A(title, href=rsrcpath))
                donode(child, c, base, subpath)
                parent.append(c)
        with open(htmlpath, 'wb') as f:
            if toc.count() > 1:
                from lxml.html.builder import HTML, BODY, DIV, A
                path0 = toc[0].href
                path0 = unquote_path(path0)
                subpath = os.path.dirname(path0)
                base = os.path.dirname(f.name)
                root = DIV()
                donode(toc, root, base, subpath)
                raw = html.tostring(HTML(BODY(root)), encoding='utf-8',
                                   pretty_print=True)
                f.write(raw)
            else:
                f.write(as_bytes(hhcdata))
        return htmlpath, toc
    def _read_file(self, name):
        with lopen(name, 'rb') as f:
            data = f.read()
        return data
    def add_node(self, node, toc, ancestor_map):
        from calibre.ebooks.chm.reader import match_string
        if match_string(node.attrib.get('type', ''), 'text/sitemap'):
            p = node.xpath('ancestor::ul[1]/ancestor::li[1]/object[1]')
            parent = p[0] if p else None
            toc = ancestor_map.get(parent, toc)
            title = href = ''
            for param in node.xpath('./param'):
                if match_string(param.attrib['name'], 'name'):
                    title = param.attrib['value']
                elif match_string(param.attrib['name'], 'local'):
                    href = param.attrib['value']
            child = toc.add(title or _('Unknown'), href)
            ancestor_map[node] = child
    def _process_nodes(self, root):
        from calibre.ebooks.oeb.base import TOC
        toc = TOC()
        ancestor_map = {}
        for node in root.xpath('//object'):
            self.add_node(node, toc, ancestor_map)
        return toc
--- a/ebook_converter/ebooks/conversion/plugins/comic_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/comic_input.py
@@ -0,0 +1,310 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
 __docformat__ = 'restructuredtext en'
 '''
 Based on ideas from comiclrf created by FangornUK.
 '''
 import shutil, textwrap, codecs, os
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from calibre import CurrentDir
 from calibre.ptempfile import PersistentTemporaryDirectory
 from polyglot.builtins import getcwd, map
 class ComicInput(InputFormatPlugin):
    name        = 'Comic Input'
    author      = 'Kovid Goyal'
    description = 'Optimize comic files (.cbz, .cbr, .cbc) for viewing on portable devices'
    file_types  = {'cbz', 'cbr', 'cbc'}
    is_image_collection = True
    commit_name = 'comic_input'
    core_usage = -1
    options = {
        OptionRecommendation(name='colors', recommended_value=0,
            help=_('Reduce the number of colors used in the image. This works only'
                   ' if you choose the PNG output format. It is useful to reduce file sizes.'
                   ' Set to zero to turn off. Maximum value is 256. It is off by default.')),
        OptionRecommendation(name='dont_normalize', recommended_value=False,
            help=_('Disable normalize (improve contrast) color range '
            'for pictures. Default: False')),
        OptionRecommendation(name='keep_aspect_ratio', recommended_value=False,
            help=_('Maintain picture aspect ratio. Default is to fill the screen.')),
        OptionRecommendation(name='dont_sharpen', recommended_value=False,
            help=_('Disable sharpening.')),
        OptionRecommendation(name='disable_trim', recommended_value=False,
            help=_('Disable trimming of comic pages. For some comics, '
                     'trimming might remove content as well as borders.')),
        OptionRecommendation(name='landscape', recommended_value=False,
            help=_("Don't split landscape images into two portrait images")),
        OptionRecommendation(name='wide', recommended_value=False,
            help=_("Keep aspect ratio and scale image using screen height as "
            "image width for viewing in landscape mode.")),
        OptionRecommendation(name='right2left', recommended_value=False,
              help=_('Used for right-to-left publications like manga. '
              'Causes landscape pages to be split into portrait pages '
              'from right to left.')),
        OptionRecommendation(name='despeckle', recommended_value=False,
              help=_('Enable Despeckle. Reduces speckle noise. '
              'May greatly increase processing time.')),
        OptionRecommendation(name='no_sort', recommended_value=False,
              help=_("Don't sort the files found in the comic "
              "alphabetically by name. Instead use the order they were "
              "added to the comic.")),
        OptionRecommendation(name='output_format', choices=['png', 'jpg'],
            recommended_value='png', help=_('The format that images in the created e-book '
                'are converted to. You can experiment to see which format gives '
                'you optimal size and look on your device.')),
        OptionRecommendation(name='no_process', recommended_value=False,
              help=_("Apply no processing to the image")),
        OptionRecommendation(name='dont_grayscale', recommended_value=False,
            help=_('Do not convert the image to grayscale (black and white)')),
        OptionRecommendation(name='comic_image_size', recommended_value=None,
            help=_('Specify the image size as widthxheight pixels. Normally,'
                ' an image size is automatically calculated from the output '
                'profile, this option overrides it.')),
        OptionRecommendation(name='dont_add_comic_pages_to_toc', recommended_value=False,
            help=_('When converting a CBC do not add links to each page to'
                ' the TOC. Note this only applies if the TOC has more than one'
                ' section')),
        }
    recommendations = {
        ('margin_left', 0, OptionRecommendation.HIGH),
        ('margin_top',  0, OptionRecommendation.HIGH),
        ('margin_right', 0, OptionRecommendation.HIGH),
        ('margin_bottom', 0, OptionRecommendation.HIGH),
        ('insert_blank_line', False, OptionRecommendation.HIGH),
        ('remove_paragraph_spacing',  False, OptionRecommendation.HIGH),
        ('change_justification', 'left', OptionRecommendation.HIGH),
        ('dont_split_on_pagebreaks', True, OptionRecommendation.HIGH),
        ('chapter', None, OptionRecommendation.HIGH),
        ('page_breaks_brefore', None, OptionRecommendation.HIGH),
        ('use_auto_toc', False, OptionRecommendation.HIGH),
        ('page_breaks_before', None, OptionRecommendation.HIGH),
        ('disable_font_rescaling', True, OptionRecommendation.HIGH),
        ('linearize_tables', False, OptionRecommendation.HIGH),
        }
    def get_comics_from_collection(self, stream):
        from calibre.libunzip import extract as zipextract
        tdir = PersistentTemporaryDirectory('_comic_collection')
        zipextract(stream, tdir)
        comics = []
        with CurrentDir(tdir):
            if not os.path.exists('comics.txt'):
                raise ValueError((
                    '%s is not a valid comic collection'
                    ' no comics.txt was found in the file')
                        %stream.name)
            with open('comics.txt', 'rb') as f:
                raw = f.read()
            if raw.startswith(codecs.BOM_UTF16_BE):
                raw = raw.decode('utf-16-be')[1:]
            elif raw.startswith(codecs.BOM_UTF16_LE):
                raw = raw.decode('utf-16-le')[1:]
            elif raw.startswith(codecs.BOM_UTF8):
                raw = raw.decode('utf-8')[1:]
            else:
                raw = raw.decode('utf-8')
            for line in raw.splitlines():
                line = line.strip()
                if not line:
                    continue
                fname, title = line.partition(':')[0], line.partition(':')[-1]
                fname = fname.replace('#', '_')
                fname = os.path.join(tdir, *fname.split('/'))
                if not title:
                    title = os.path.basename(fname).rpartition('.')[0]
                if os.access(fname, os.R_OK):
                    comics.append([title, fname])
        if not comics:
            raise ValueError('%s has no comics'%stream.name)
        return comics
    def get_pages(self, comic, tdir2):
        from calibre.ebooks.comic.input import (extract_comic,  process_pages,
                find_pages)
        tdir  = extract_comic(comic)
        new_pages = find_pages(tdir, sort_on_mtime=self.opts.no_sort,
                verbose=self.opts.verbose)
        thumbnail = None
        if not new_pages:
            raise ValueError('Could not find any pages in the comic: %s'
                    %comic)
        if self.opts.no_process:
            n2 = []
            for i, page in enumerate(new_pages):
                n2.append(os.path.join(tdir2, '{} - {}' .format(i, os.path.basename(page))))
                shutil.copyfile(page, n2[-1])
            new_pages = n2
        else:
            new_pages, failures = process_pages(new_pages, self.opts,
                    self.report_progress, tdir2)
            if failures:
                self.log.warning('Could not process the following pages '
                '(run with --verbose to see why):')
                for f in failures:
                    self.log.warning('\t', f)
            if not new_pages:
                raise ValueError('Could not find any valid pages in comic: %s'
                        % comic)
            thumbnail = os.path.join(tdir2,
                    'thumbnail.'+self.opts.output_format.lower())
            if not os.access(thumbnail, os.R_OK):
                thumbnail = None
        return new_pages
    def get_images(self):
        return self._images
    def convert(self, stream, opts, file_ext, log, accelerators):
        from calibre.ebooks.metadata import MetaInformation
        from calibre.ebooks.metadata.opf2 import OPFCreator
        from calibre.ebooks.metadata.toc import TOC
        self.opts, self.log= opts, log
        if file_ext == 'cbc':
            comics_ = self.get_comics_from_collection(stream)
        else:
            comics_ = [['Comic', os.path.abspath(stream.name)]]
        stream.close()
        comics = []
        for i, x in enumerate(comics_):
            title, fname = x
            cdir = 'comic_%d'%(i+1) if len(comics_) > 1 else '.'
            cdir = os.path.abspath(cdir)
            if not os.path.exists(cdir):
                os.makedirs(cdir)
            pages = self.get_pages(fname, cdir)
            if not pages:
                continue
            if self.for_viewer:
                comics.append((title, pages, [self.create_viewer_wrapper(pages)]))
            else:
                wrappers = self.create_wrappers(pages)
                comics.append((title, pages, wrappers))
        if not comics:
            raise ValueError('No comic pages found in %s'%stream.name)
        mi  = MetaInformation(os.path.basename(stream.name).rpartition('.')[0],
            [_('Unknown')])
        opf = OPFCreator(getcwd(), mi)
        entries = []
        def href(x):
            if len(comics) == 1:
                return os.path.basename(x)
            return '/'.join(x.split(os.sep)[-2:])
        cover_href = None
        for comic in comics:
            pages, wrappers = comic[1:]
            page_entries = [(x, None) for x in map(href, pages)]
            entries += [(w, None) for w in map(href, wrappers)] + page_entries
            if cover_href is None and page_entries:
                cover_href = page_entries[0][0]
        opf.create_manifest(entries)
        spine = []
        for comic in comics:
            spine.extend(map(href, comic[2]))
        self._images = []
        for comic in comics:
            self._images.extend(comic[1])
        opf.create_spine(spine)
        if self.for_viewer and cover_href:
            opf.guide.set_cover(cover_href)
        toc = TOC()
        if len(comics) == 1:
            wrappers = comics[0][2]
            for i, x in enumerate(wrappers):
                toc.add_item(href(x), None, _('Page')+' %d'%(i+1),
                        play_order=i)
        else:
            po = 0
            for comic in comics:
                po += 1
                wrappers = comic[2]
                stoc = toc.add_item(href(wrappers[0]),
                        None, comic[0], play_order=po)
                if not opts.dont_add_comic_pages_to_toc:
                    for i, x in enumerate(wrappers):
                        stoc.add_item(href(x), None,
                                _('Page')+' %d'%(i+1), play_order=po)
                        po += 1
        opf.set_toc(toc)
        with open('metadata.opf', 'wb') as m, open('toc.ncx', 'wb') as n:
            opf.render(m, n, 'toc.ncx')
        return os.path.abspath('metadata.opf')
    def create_wrappers(self, pages):
        from calibre.ebooks.oeb.base import XHTML_NS
        wrappers = []
        WRAPPER = textwrap.dedent('''\
        <html xmlns="%s">
            <head>
                <meta charset="utf-8"/>
                <title>Page #%d</title>
                <style type="text/css">
                    @page { margin:0pt; padding: 0pt}
                    body { margin: 0pt; padding: 0pt}
                    div { text-align: center }
                </style>
            </head>
            <body>
                <div>
                    <img src="%s" alt="comic page #%d" />
                </div>
            </body>
        </html>
        ''')
        dir = os.path.dirname(pages[0])
        for i, page in enumerate(pages):
            wrapper = WRAPPER%(XHTML_NS, i+1, os.path.basename(page), i+1)
            page = os.path.join(dir, 'page_%d.xhtml'%(i+1))
            with open(page, 'wb') as f:
                f.write(wrapper.encode('utf-8'))
            wrappers.append(page)
        return wrappers
    def create_viewer_wrapper(self, pages):
        from calibre.ebooks.oeb.base import XHTML_NS
        def page(src):
            return '<img src="{}"></img>'.format(os.path.basename(src))
        pages = '\n'.join(map(page, pages))
        base = os.path.dirname(pages[0])
        wrapper = '''
        <html xmlns="%s">
            <head>
                <meta charset="utf-8"/>
                <style type="text/css">
                html, body, img { height: 100vh; display: block; margin: 0; padding: 0; border-width: 0; }
                img {
                    width: 100%%; height: 100%%;
                    object-fit: contain;
                    margin-left: auto; margin-right: auto;
                    max-width: 100vw; max-height: 100vh;
                    top: 50vh; transform: translateY(-50%%);
                    position: relative;
                    page-break-after: always;
                }
                </style>
            </head>
            <body>
            %s
            </body>
        </html>
        ''' % (XHTML_NS, pages)
        path = os.path.join(base, 'wrapper.xhtml')
        with open(path, 'wb') as f:
            f.write(wrapper.encode('utf-8'))
        return path
--- a/ebook_converter/ebooks/conversion/plugins/djvu_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/djvu_input.py
@@ -0,0 +1,67 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2011, Anthon van der Neut <anthon@mnt.org>'
 __docformat__ = 'restructuredtext en'
 import os
 from io import BytesIO
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import getcwd
 class DJVUInput(InputFormatPlugin):
    name        = 'DJVU Input'
    author      = 'Anthon van der Neut'
    description = 'Convert OCR-ed DJVU files (.djvu) to HTML'
    file_types  = {'djvu', 'djv'}
    commit_name = 'djvu_input'
    def convert(self, stream, options, file_ext, log, accelerators):
        from calibre.ebooks.txt.processor import convert_basic
        stdout = BytesIO()
        from calibre.ebooks.djvu.djvu import DJVUFile
        x = DJVUFile(stream)
        x.get_text(stdout)
        raw_text = stdout.getvalue()
        if not raw_text:
            raise ValueError('The DJVU file contains no text, only images, probably page scans.'
                    ' calibre only supports conversion of DJVU files with actual text in them.')
        html = convert_basic(raw_text.replace(b"\n", b' ').replace(
            b'\037', b'\n\n'))
        # Run the HTMLized text through the html processing plugin.
        from calibre.customize.ui import plugin_for_input_format
        html_input = plugin_for_input_format('html')
        for opt in html_input.options:
            setattr(options, opt.option.name, opt.recommended_value)
        options.input_encoding = 'utf-8'
        base = getcwd()
        htmlfile = os.path.join(base, 'index.html')
        c = 0
        while os.path.exists(htmlfile):
            c += 1
            htmlfile = os.path.join(base, 'index%d.html'%c)
        with open(htmlfile, 'wb') as f:
            f.write(html.encode('utf-8'))
        odi = options.debug_pipeline
        options.debug_pipeline = None
        # Generate oeb from html conversion.
        with open(htmlfile, 'rb') as f:
            oeb = html_input.convert(f, options, 'html', log,
                {})
        options.debug_pipeline = odi
        os.remove(htmlfile)
        # Set metadata from file.
        from calibre.customize.ui import get_file_type_metadata
        from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
        mi = get_file_type_metadata(stream, file_ext)
        meta_info_to_oeb_metadata(mi, oeb.metadata, log)
        return oeb
--- a/ebook_converter/ebooks/conversion/plugins/docx_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/docx_input.py
@@ -0,0 +1,34 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 class DOCXInput(InputFormatPlugin):
    name        = 'DOCX Input'
    author      = 'Kovid Goyal'
    description = _('Convert DOCX files (.docx and .docm) to HTML')
    file_types  = {'docx', 'docm'}
    commit_name = 'docx_input'
    options = {
        OptionRecommendation(name='docx_no_cover', recommended_value=False,
            help=_('Normally, if a large image is present at the start of the document that looks like a cover, '
                   'it will be removed from the document and used as the cover for created e-book. This option '
                   'turns off that behavior.')),
        OptionRecommendation(name='docx_no_pagebreaks_between_notes', recommended_value=False,
            help=_('Do not insert a page break after every endnote.')),
        OptionRecommendation(name='docx_inline_subsup', recommended_value=False,
            help=_('Render superscripts and subscripts so that they do not affect the line height.')),
    }
    recommendations = {('page_breaks_before', '/', OptionRecommendation.MED)}
    def convert(self, stream, options, file_ext, log, accelerators):
        from calibre.ebooks.docx.to_html import Convert
        return Convert(stream, detect_cover=not options.docx_no_cover, log=log, notes_nopb=options.docx_no_pagebreaks_between_notes,
                       nosupsub=options.docx_inline_subsup)()
--- a/ebook_converter/ebooks/conversion/plugins/docx_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/docx_output.py
@@ -0,0 +1,93 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
 PAGE_SIZES = ['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'b0', 'b1',
              'b2', 'b3', 'b4', 'b5', 'b6', 'legal', 'letter']
 class DOCXOutput(OutputFormatPlugin):
    name = 'DOCX Output'
    author = 'Kovid Goyal'
    file_type = 'docx'
    commit_name = 'docx_output'
    ui_data = {'page_sizes': PAGE_SIZES}
    options = {
        OptionRecommendation(name='docx_page_size', recommended_value='letter',
            level=OptionRecommendation.LOW, choices=PAGE_SIZES,
            help=_('The size of the page. Default is letter. Choices '
            'are %s') % PAGE_SIZES),
        OptionRecommendation(name='docx_custom_page_size', recommended_value=None,
            help=_('Custom size of the document. Use the form widthxheight '
            'EG. `123x321` to specify the width and height (in pts). '
            'This overrides any specified page-size.')),
        OptionRecommendation(name='docx_no_cover', recommended_value=False,
            help=_('Do not insert the book cover as an image at the start of the document.'
                   ' If you use this option, the book cover will be discarded.')),
        OptionRecommendation(name='preserve_cover_aspect_ratio', recommended_value=False,
            help=_('Preserve the aspect ratio of the cover image instead of stretching'
                   ' it out to cover the entire page.')),
        OptionRecommendation(name='docx_no_toc', recommended_value=False,
            help=_('Do not insert the table of contents as a page at the start of the document.')),
        OptionRecommendation(name='extract_to',
            help=_('Extract the contents of the generated %s file to the '
                'specified directory. The contents of the directory are first '
                'deleted, so be careful.') % 'DOCX'),
        OptionRecommendation(name='docx_page_margin_left', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the left page margin, in pts. Default is 72pt.'
                   ' Overrides the common left page margin setting.')
        ),
        OptionRecommendation(name='docx_page_margin_top', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the top page margin, in pts. Default is 72pt.'
                   ' Overrides the common top page margin setting, unless set to zero.')
        ),
        OptionRecommendation(name='docx_page_margin_right', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the right page margin, in pts. Default is 72pt.'
                   ' Overrides the common right page margin setting, unless set to zero.')
        ),
        OptionRecommendation(name='docx_page_margin_bottom', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the bottom page margin, in pts. Default is 72pt.'
                   ' Overrides the common bottom page margin setting, unless set to zero.')
        ),
    }
    def convert_metadata(self, oeb):
        from lxml import etree
        from calibre.ebooks.oeb.base import OPF, OPF2_NS
        from calibre.ebooks.metadata.opf2 import OPF as ReadOPF
        from io import BytesIO
        package = etree.Element(OPF('package'), attrib={'version': '2.0'}, nsmap={None: OPF2_NS})
        oeb.metadata.to_opf2(package)
        self.mi = ReadOPF(BytesIO(etree.tostring(package, encoding='utf-8')), populate_spine=False, try_to_guess_cover=False).to_book_metadata()
    def convert(self, oeb, output_path, input_plugin, opts, log):
        from calibre.ebooks.docx.writer.container import DOCX
        from calibre.ebooks.docx.writer.from_html import Convert
        docx = DOCX(opts, log)
        self.convert_metadata(oeb)
        Convert(oeb, docx, self.mi, not opts.docx_no_cover, not opts.docx_no_toc)()
        docx.write(output_path, self.mi)
        if opts.extract_to:
            from calibre.ebooks.docx.dump import do_dump
            do_dump(output_path, opts.extract_to)
--- a/ebook_converter/ebooks/conversion/plugins/epub_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/epub_input.py
@@ -0,0 +1,438 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os, re, posixpath
 from itertools import cycle
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from polyglot.builtins import getcwd
 ADOBE_OBFUSCATION =  'http://ns.adobe.com/pdf/enc#RC'
 IDPF_OBFUSCATION = 'http://www.idpf.org/2008/embedding'
 def decrypt_font_data(key, data, algorithm):
    is_adobe = algorithm == ADOBE_OBFUSCATION
    crypt_len = 1024 if is_adobe else 1040
    crypt = bytearray(data[:crypt_len])
    key = cycle(iter(bytearray(key)))
    decrypt = bytes(bytearray(x^next(key) for x in crypt))
    return decrypt + data[crypt_len:]
 def decrypt_font(key, path, algorithm):
    with lopen(path, 'r+b') as f:
        data = decrypt_font_data(key, f.read(), algorithm)
        f.seek(0), f.truncate(), f.write(data)
 class EPUBInput(InputFormatPlugin):
    name        = 'EPUB Input'
    author      = 'Kovid Goyal'
    description = 'Convert EPUB files (.epub) to HTML'
    file_types  = {'epub'}
    output_encoding = None
    commit_name = 'epub_input'
    recommendations = {('page_breaks_before', '/', OptionRecommendation.MED)}
    def process_encryption(self, encfile, opf, log):
        from lxml import etree
        import uuid, hashlib
        idpf_key = opf.raw_unique_identifier
        if idpf_key:
            idpf_key = re.sub('[\u0020\u0009\u000d\u000a]', '', idpf_key)
            idpf_key = hashlib.sha1(idpf_key.encode('utf-8')).digest()
        key = None
        for item in opf.identifier_iter():
            scheme = None
            for xkey in item.attrib.keys():
                if xkey.endswith('scheme'):
                    scheme = item.get(xkey)
            if (scheme and scheme.lower() == 'uuid') or \
                    (item.text and item.text.startswith('urn:uuid:')):
                try:
                    key = item.text.rpartition(':')[-1]
                    key = uuid.UUID(key).bytes
                except:
                    import traceback
                    traceback.print_exc()
                    key = None
        try:
            root = etree.parse(encfile)
            for em in root.xpath('descendant::*[contains(name(), "EncryptionMethod")]'):
                algorithm = em.get('Algorithm', '')
                if algorithm not in {ADOBE_OBFUSCATION, IDPF_OBFUSCATION}:
                    return False
                cr = em.getparent().xpath('descendant::*[contains(name(), "CipherReference")]')[0]
                uri = cr.get('URI')
                path = os.path.abspath(os.path.join(os.path.dirname(encfile), '..', *uri.split('/')))
                tkey = (key if algorithm == ADOBE_OBFUSCATION else idpf_key)
                if (tkey and os.path.exists(path)):
                    self._encrypted_font_uris.append(uri)
                    decrypt_font(tkey, path, algorithm)
            return True
        except:
            import traceback
            traceback.print_exc()
        return False
    def set_guide_type(self, opf, gtype, href=None, title=''):
        # Set the specified guide entry
        for elem in list(opf.iterguide()):
            if elem.get('type', '').lower() == gtype:
                elem.getparent().remove(elem)
        if href is not None:
            t = opf.create_guide_item(gtype, title, href)
            for guide in opf.root.xpath('./*[local-name()="guide"]'):
                guide.append(t)
                return
            guide = opf.create_guide_element()
            opf.root.append(guide)
            guide.append(t)
            return t
    def rationalize_cover3(self, opf, log):
        ''' If there is a reference to the cover/titlepage via manifest properties, convert to
        entries in the <guide> so that the rest of the pipeline picks it up. '''
        from calibre.ebooks.metadata.opf3 import items_with_property
        removed = guide_titlepage_href = guide_titlepage_id = None
        # Look for titlepages incorrectly marked in the <guide> as covers
        guide_cover, guide_elem = None, None
        for guide_elem in opf.iterguide():
            if guide_elem.get('type', '').lower() == 'cover':
                guide_cover = guide_elem.get('href', '').partition('#')[0]
                break
        if guide_cover:
            spine = list(opf.iterspine())
            if spine:
                idref = spine[0].get('idref', '')
                for x in opf.itermanifest():
                    if x.get('id') == idref and x.get('href') == guide_cover:
                        guide_titlepage_href = guide_cover
                        guide_titlepage_id = idref
                        break
        raster_cover_href = opf.epub3_raster_cover or opf.raster_cover
        if raster_cover_href:
            self.set_guide_type(opf, 'cover', raster_cover_href, 'Cover Image')
        titlepage_id = titlepage_href = None
        for item in items_with_property(opf.root, 'calibre:title-page'):
            tid, href = item.get('id'), item.get('href')
            if href and tid:
                titlepage_id, titlepage_href = tid, href.partition('#')[0]
                break
        if titlepage_href is None:
            titlepage_href, titlepage_id = guide_titlepage_href, guide_titlepage_id
        if titlepage_href is not None:
            self.set_guide_type(opf, 'titlepage', titlepage_href, 'Title Page')
            spine = list(opf.iterspine())
            if len(spine) > 1:
                for item in spine:
                    if item.get('idref') == titlepage_id:
                        log('Found HTML cover', titlepage_href)
                        if self.for_viewer:
                            item.attrib.pop('linear', None)
                        else:
                            item.getparent().remove(item)
                            removed = titlepage_href
                        return removed
    def rationalize_cover2(self, opf, log):
        ''' Ensure that the cover information in the guide is correct. That
        means, at most one entry with type="cover" that points to a raster
        cover and at most one entry with type="titlepage" that points to an
        HTML titlepage. '''
        from calibre.ebooks.oeb.base import OPF
        removed = None
        from lxml import etree
        guide_cover, guide_elem = None, None
        for guide_elem in opf.iterguide():
            if guide_elem.get('type', '').lower() == 'cover':
                guide_cover = guide_elem.get('href', '').partition('#')[0]
                break
        if not guide_cover:
            raster_cover = opf.raster_cover
            if raster_cover:
                if guide_elem is None:
                    g = opf.root.makeelement(OPF('guide'))
                    opf.root.append(g)
                else:
                    g = guide_elem.getparent()
                guide_cover = raster_cover
                guide_elem = g.makeelement(OPF('reference'), attrib={'href':raster_cover, 'type':'cover'})
                g.append(guide_elem)
            return
        spine = list(opf.iterspine())
        if not spine:
            return
        # Check if the cover specified in the guide is also
        # the first element in spine
        idref = spine[0].get('idref', '')
        manifest = list(opf.itermanifest())
        if not manifest:
            return
        elem = [x for x in manifest if x.get('id', '') == idref]
        if not elem or elem[0].get('href', None) != guide_cover:
            return
        log('Found HTML cover', guide_cover)
        # Remove from spine as covers must be treated
        # specially
        if not self.for_viewer:
            if len(spine) == 1:
                log.warn('There is only a single spine item and it is marked as the cover. Removing cover marking.')
                for guide_elem in tuple(opf.iterguide()):
                    if guide_elem.get('type', '').lower() == 'cover':
                        guide_elem.getparent().remove(guide_elem)
                return
            else:
                spine[0].getparent().remove(spine[0])
                removed = guide_cover
        else:
            # Ensure the cover is displayed as the first item in the book, some
            # epub files have it set with linear='no' which causes the cover to
            # display in the end
            spine[0].attrib.pop('linear', None)
            opf.spine[0].is_linear = True
        # Ensure that the guide has a cover entry pointing to a raster cover
        # and a titlepage entry pointing to the html titlepage. The titlepage
        # entry will be used by the epub output plugin, the raster cover entry
        # by other output plugins.
        # Search for a raster cover identified in the OPF
        raster_cover = opf.raster_cover
        # Set the cover guide entry
        if raster_cover is not None:
            guide_elem.set('href', raster_cover)
        else:
            # Render the titlepage to create a raster cover
            from calibre.ebooks import render_html_svg_workaround
            guide_elem.set('href', 'calibre_raster_cover.jpg')
            t = etree.SubElement(
                elem[0].getparent(), OPF('item'), href=guide_elem.get('href'), id='calibre_raster_cover')
            t.set('media-type', 'image/jpeg')
            if os.path.exists(guide_cover):
                renderer = render_html_svg_workaround(guide_cover, log)
                if renderer is not None:
                    with lopen('calibre_raster_cover.jpg', 'wb') as f:
                        f.write(renderer)
        # Set the titlepage guide entry
        self.set_guide_type(opf, 'titlepage', guide_cover, 'Title Page')
        return removed
    def find_opf(self):
        from calibre.utils.xml_parse import safe_xml_fromstring
        def attr(n, attr):
            for k, v in n.attrib.items():
                if k.endswith(attr):
                    return v
        try:
            with lopen('META-INF/container.xml', 'rb') as f:
                root = safe_xml_fromstring(f.read())
                for r in root.xpath('//*[local-name()="rootfile"]'):
                    if attr(r, 'media-type') != "application/oebps-package+xml":
                        continue
                    path = attr(r, 'full-path')
                    if not path:
                        continue
                    path = os.path.join(getcwd(), *path.split('/'))
                    if os.path.exists(path):
                        return path
        except Exception:
            import traceback
            traceback.print_exc()
    def convert(self, stream, options, file_ext, log, accelerators):
        from calibre.utils.zipfile import ZipFile
        from calibre import walk
        from calibre.ebooks import DRMError
        from calibre.ebooks.metadata.opf2 import OPF
        try:
            zf = ZipFile(stream)
            zf.extractall(getcwd())
        except:
            log.exception('EPUB appears to be invalid ZIP file, trying a'
                    ' more forgiving ZIP parser')
            from calibre.utils.localunzip import extractall
            stream.seek(0)
            extractall(stream)
        encfile = os.path.abspath(os.path.join('META-INF', 'encryption.xml'))
        opf = self.find_opf()
        if opf is None:
            for f in walk('.'):
                if f.lower().endswith('.opf') and '__MACOSX' not in f and \
                        not os.path.basename(f).startswith('.'):
                    opf = os.path.abspath(f)
                    break
        path = getattr(stream, 'name', 'stream')
        if opf is None:
            raise ValueError('%s is not a valid EPUB file (could not find opf)'%path)
        opf = os.path.relpath(opf, getcwd())
        parts = os.path.split(opf)
        opf = OPF(opf, os.path.dirname(os.path.abspath(opf)))
        self._encrypted_font_uris = []
        if os.path.exists(encfile):
            if not self.process_encryption(encfile, opf, log):
                raise DRMError(os.path.basename(path))
        self.encrypted_fonts = self._encrypted_font_uris
        if len(parts) > 1 and parts[0]:
            delta = '/'.join(parts[:-1])+'/'
            def normpath(x):
                return posixpath.normpath(delta + elem.get('href'))
            for elem in opf.itermanifest():
                elem.set('href', normpath(elem.get('href')))
            for elem in opf.iterguide():
                elem.set('href', normpath(elem.get('href')))
        f = self.rationalize_cover3 if opf.package_version >= 3.0 else self.rationalize_cover2
        self.removed_cover = f(opf, log)
        if self.removed_cover:
            self.removed_items_to_ignore = (self.removed_cover,)
        epub3_nav = opf.epub3_nav
        if epub3_nav is not None:
            self.convert_epub3_nav(epub3_nav, opf, log, options)
        for x in opf.itermanifest():
            if x.get('media-type', '') == 'application/x-dtbook+xml':
                raise ValueError(
                    'EPUB files with DTBook markup are not supported')
        not_for_spine = set()
        for y in opf.itermanifest():
            id_ = y.get('id', None)
            if id_:
                mt = y.get('media-type', None)
                if mt in {
                        'application/vnd.adobe-page-template+xml',
                        'application/vnd.adobe.page-template+xml',
                        'application/adobe-page-template+xml',
                        'application/adobe.page-template+xml',
                        'application/text'
                }:
                    not_for_spine.add(id_)
                ext = y.get('href', '').rpartition('.')[-1].lower()
                if mt == 'text/plain' and ext in {'otf', 'ttf'}:
                    # some epub authoring software sets font mime types to
                    # text/plain
                    not_for_spine.add(id_)
                    y.set('media-type', 'application/font')
        seen = set()
        for x in list(opf.iterspine()):
            ref = x.get('idref', None)
            if not ref or ref in not_for_spine or ref in seen:
                x.getparent().remove(x)
                continue
            seen.add(ref)
        if len(list(opf.iterspine())) == 0:
            raise ValueError('No valid entries in the spine of this EPUB')
        with lopen('content.opf', 'wb') as nopf:
            nopf.write(opf.render())
        return os.path.abspath('content.opf')
    def convert_epub3_nav(self, nav_path, opf, log, opts):
        from lxml import etree
        from calibre.ebooks.chardet import xml_to_unicode
        from calibre.ebooks.oeb.polish.parsing import parse
        from calibre.ebooks.oeb.base import EPUB_NS, XHTML, NCX_MIME, NCX, urlnormalize, urlunquote, serialize
        from calibre.ebooks.oeb.polish.toc import first_child
        from calibre.utils.xml_parse import safe_xml_fromstring
        from tempfile import NamedTemporaryFile
        with lopen(nav_path, 'rb') as f:
            raw = f.read()
        raw = xml_to_unicode(raw, strip_encoding_pats=True, assume_utf8=True)[0]
        root = parse(raw, log=log)
        ncx = safe_xml_fromstring('<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="eng"><navMap/></ncx>')
        navmap = ncx[0]
        et = '{%s}type' % EPUB_NS
        bn = os.path.basename(nav_path)
        def add_from_li(li, parent):
            href = text = None
            for x in li.iterchildren(XHTML('a'), XHTML('span')):
                text = etree.tostring(
                    x, method='text', encoding='unicode', with_tail=False).strip() or ' '.join(
                            x.xpath('descendant-or-self::*/@title')).strip()
                href = x.get('href')
                if href:
                    if href.startswith('#'):
                        href = bn + href
                break
            np = parent.makeelement(NCX('navPoint'))
            parent.append(np)
            np.append(np.makeelement(NCX('navLabel')))
            np[0].append(np.makeelement(NCX('text')))
            np[0][0].text = text
            if href:
                np.append(np.makeelement(NCX('content'), attrib={'src':href}))
            return np
        def process_nav_node(node, toc_parent):
            for li in node.iterchildren(XHTML('li')):
                child = add_from_li(li, toc_parent)
                ol = first_child(li, XHTML('ol'))
                if child is not None and ol is not None:
                    process_nav_node(ol, child)
        for nav in root.iterdescendants(XHTML('nav')):
            if nav.get(et) == 'toc':
                ol = first_child(nav, XHTML('ol'))
                if ol is not None:
                    process_nav_node(ol, navmap)
                    break
        else:
            return
        with NamedTemporaryFile(suffix='.ncx', dir=os.path.dirname(nav_path), delete=False) as f:
            f.write(etree.tostring(ncx, encoding='utf-8'))
        ncx_href = os.path.relpath(f.name, getcwd()).replace(os.sep, '/')
        ncx_id = opf.create_manifest_item(ncx_href, NCX_MIME, append=True).get('id')
        for spine in opf.root.xpath('//*[local-name()="spine"]'):
            spine.set('toc', ncx_id)
        opts.epub3_nav_href = urlnormalize(os.path.relpath(nav_path).replace(os.sep, '/'))
        opts.epub3_nav_parsed = root
        if getattr(self, 'removed_cover', None):
            changed = False
            base_path = os.path.dirname(nav_path)
            for elem in root.xpath('//*[@href]'):
                href, frag = elem.get('href').partition('#')[::2]
                link_path = os.path.relpath(os.path.join(base_path, urlunquote(href)), base_path)
                abs_href = urlnormalize(link_path)
                if abs_href == self.removed_cover:
                    changed = True
                    elem.set('data-calibre-removed-titlepage', '1')
            if changed:
                with lopen(nav_path, 'wb') as f:
                    f.write(serialize(root, 'application/xhtml+xml'))
    def postprocess_book(self, oeb, opts, log):
        rc = getattr(self, 'removed_cover', None)
        if rc:
            cover_toc_item = None
            for item in oeb.toc.iterdescendants():
                if item.href and item.href.partition('#')[0] == rc:
                    cover_toc_item = item
                    break
            spine = {x.href for x in oeb.spine}
            if (cover_toc_item is not None and cover_toc_item not in spine):
                oeb.toc.item_that_refers_to_cover = cover_toc_item
--- a/ebook_converter/ebooks/conversion/plugins/epub_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/epub_output.py
@@ -0,0 +1,548 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os, shutil, re
 from calibre.customize.conversion import (OutputFormatPlugin,
        OptionRecommendation)
 from calibre.ptempfile import TemporaryDirectory
 from calibre import CurrentDir
 from polyglot.builtins import unicode_type, filter, map, zip, range, as_bytes
 block_level_tags = (
      'address',
      'body',
      'blockquote',
      'center',
      'dir',
      'div',
      'dl',
      'fieldset',
      'form',
      'h1',
      'h2',
      'h3',
      'h4',
      'h5',
      'h6',
      'hr',
      'isindex',
      'menu',
      'noframes',
      'noscript',
      'ol',
      'p',
      'pre',
      'table',
      'ul',
 )
 class EPUBOutput(OutputFormatPlugin):
    name = 'EPUB Output'
    author = 'Kovid Goyal'
    file_type = 'epub'
    commit_name = 'epub_output'
    ui_data = {'versions': ('2', '3')}
    options = {
        OptionRecommendation(name='extract_to',
            help=_('Extract the contents of the generated %s file to the '
                'specified directory. The contents of the directory are first '
                'deleted, so be careful.') % 'EPUB'),
        OptionRecommendation(name='dont_split_on_page_breaks',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Turn off splitting at page breaks. Normally, input '
                    'files are automatically split at every page break into '
                    'two files. This gives an output e-book that can be '
                    'parsed faster and with less resources. However, '
                    'splitting is slow and if your source file contains a '
                    'very large number of page breaks, you should turn off '
                    'splitting on page breaks.'
                )
        ),
        OptionRecommendation(name='flow_size', recommended_value=260,
            help=_('Split all HTML files larger than this size (in KB). '
                'This is necessary as most EPUB readers cannot handle large '
                'file sizes. The default of %defaultKB is the size required '
                'for Adobe Digital Editions. Set to 0 to disable size based splitting.')
        ),
        OptionRecommendation(name='no_default_epub_cover', recommended_value=False,
            help=_('Normally, if the input file has no cover and you don\'t'
            ' specify one, a default cover is generated with the title, '
            'authors, etc. This option disables the generation of this cover.')
        ),
        OptionRecommendation(name='no_svg_cover', recommended_value=False,
            help=_('Do not use SVG for the book cover. Use this option if '
                'your EPUB is going to be used on a device that does not '
                'support SVG, like the iPhone or the JetBook Lite. '
                'Without this option, such devices will display the cover '
                'as a blank page.')
        ),
        OptionRecommendation(name='preserve_cover_aspect_ratio',
            recommended_value=False, help=_(
            'When using an SVG cover, this option will cause the cover to scale '
            'to cover the available screen area, but still preserve its aspect ratio '
            '(ratio of width to height). That means there may be white borders '
            'at the sides or top and bottom of the image, but the image will '
            'never be distorted. Without this option the image may be slightly '
            'distorted, but there will be no borders.'
            )
        ),
        OptionRecommendation(name='epub_flatten', recommended_value=False,
            help=_('This option is needed only if you intend to use the EPUB'
                ' with FBReaderJ. It will flatten the file system inside the'
                ' EPUB, putting all files into the top level.')
        ),
        OptionRecommendation(name='epub_inline_toc', recommended_value=False,
            help=_('Insert an inline Table of Contents that will appear as part of the main book content.')
        ),
        OptionRecommendation(name='epub_toc_at_end', recommended_value=False,
            help=_('Put the inserted inline Table of Contents at the end of the book instead of the start.')
        ),
        OptionRecommendation(name='toc_title', recommended_value=None,
            help=_('Title for any generated in-line table of contents.')
        ),
        OptionRecommendation(name='epub_version', recommended_value='2', choices=ui_data['versions'],
            help=_('The version of the EPUB file to generate. EPUB 2 is the'
                ' most widely compatible, only use EPUB 3 if you know you'
                ' actually need it.')
        ),
        }
    recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
    def workaround_webkit_quirks(self):  # {{{
        from calibre.ebooks.oeb.base import XPath
        for x in self.oeb.spine:
            root = x.data
            body = XPath('//h:body')(root)
            if body:
                body = body[0]
            if not hasattr(body, 'xpath'):
                continue
            for pre in XPath('//h:pre')(body):
                if not pre.text and len(pre) == 0:
                    pre.tag = 'div'
    # }}}
    def upshift_markup(self):  # {{{
        'Upgrade markup to comply with XHTML 1.1 where possible'
        from calibre.ebooks.oeb.base import XPath, XML
        for x in self.oeb.spine:
            root = x.data
            if (not root.get(XML('lang'))) and (root.get('lang')):
                root.set(XML('lang'), root.get('lang'))
            body = XPath('//h:body')(root)
            if body:
                body = body[0]
            if not hasattr(body, 'xpath'):
                continue
            for u in XPath('//h:u')(root):
                u.tag = 'span'
            seen_ids, seen_names = set(), set()
            for x in XPath('//*[@id or @name]')(root):
                eid, name = x.get('id', None), x.get('name', None)
                if eid:
                    if eid in seen_ids:
                        del x.attrib['id']
                    else:
                        seen_ids.add(eid)
                if name:
                    if name in seen_names:
                        del x.attrib['name']
                    else:
                        seen_names.add(name)
    # }}}
    def convert(self, oeb, output_path, input_plugin, opts, log):
        self.log, self.opts, self.oeb = log, opts, oeb
        if self.opts.epub_inline_toc:
            from calibre.ebooks.mobi.writer8.toc import TOCAdder
            opts.mobi_toc_at_start = not opts.epub_toc_at_end
            opts.mobi_passthrough = False
            opts.no_inline_toc = False
            TOCAdder(oeb, opts, replace_previous_inline_toc=True, ignore_existing_toc=True)
        if self.opts.epub_flatten:
            from calibre.ebooks.oeb.transforms.filenames import FlatFilenames
            FlatFilenames()(oeb, opts)
        else:
            from calibre.ebooks.oeb.transforms.filenames import UniqueFilenames
            UniqueFilenames()(oeb, opts)
        self.workaround_ade_quirks()
        self.workaround_webkit_quirks()
        self.upshift_markup()
        from calibre.ebooks.oeb.transforms.rescale import RescaleImages
        RescaleImages(check_colorspaces=True)(oeb, opts)
        from calibre.ebooks.oeb.transforms.split import Split
        split = Split(not self.opts.dont_split_on_page_breaks,
                max_flow_size=self.opts.flow_size*1024
                )
        split(self.oeb, self.opts)
        from calibre.ebooks.oeb.transforms.cover import CoverManager
        cm = CoverManager(
                no_default_cover=self.opts.no_default_epub_cover,
                no_svg_cover=self.opts.no_svg_cover,
                preserve_aspect_ratio=self.opts.preserve_cover_aspect_ratio)
        cm(self.oeb, self.opts, self.log)
        self.workaround_sony_quirks()
        if self.oeb.toc.count() == 0:
            self.log.warn('This EPUB file has no Table of Contents. '
                    'Creating a default TOC')
            first = next(iter(self.oeb.spine))
            self.oeb.toc.add(_('Start'), first.href)
        from calibre.ebooks.oeb.base import OPF
        identifiers = oeb.metadata['identifier']
        uuid = None
        for x in identifiers:
            if x.get(OPF('scheme'), None).lower() == 'uuid' or unicode_type(x).startswith('urn:uuid:'):
                uuid = unicode_type(x).split(':')[-1]
                break
        encrypted_fonts = getattr(input_plugin, 'encrypted_fonts', [])
        if uuid is None:
            self.log.warn('No UUID identifier found')
            from uuid import uuid4
            uuid = unicode_type(uuid4())
            oeb.metadata.add('identifier', uuid, scheme='uuid', id=uuid)
        if encrypted_fonts and not uuid.startswith('urn:uuid:'):
            # Apparently ADE requires this value to start with urn:uuid:
            # for some absurd reason, or it will throw a hissy fit and refuse
            # to use the obfuscated fonts.
            for x in identifiers:
                if unicode_type(x) == uuid:
                    x.content = 'urn:uuid:'+uuid
        with TemporaryDirectory('_epub_output') as tdir:
            from calibre.customize.ui import plugin_for_output_format
            metadata_xml = None
            extra_entries = []
            if self.is_periodical:
                if self.opts.output_profile.epub_periodical_format == 'sony':
                    from calibre.ebooks.epub.periodical import sony_metadata
                    metadata_xml, atom_xml = sony_metadata(oeb)
                    extra_entries = [('atom.xml', 'application/atom+xml', atom_xml)]
            oeb_output = plugin_for_output_format('oeb')
            oeb_output.convert(oeb, tdir, input_plugin, opts, log)
            opf = [x for x in os.listdir(tdir) if x.endswith('.opf')][0]
            self.condense_ncx([os.path.join(tdir, x) for x in os.listdir(tdir)
                    if x.endswith('.ncx')][0])
            if self.opts.epub_version == '3':
                self.upgrade_to_epub3(tdir, opf)
            encryption = None
            if encrypted_fonts:
                encryption = self.encrypt_fonts(encrypted_fonts, tdir, uuid)
            from calibre.ebooks.epub import initialize_container
            with initialize_container(output_path, os.path.basename(opf),
                    extra_entries=extra_entries) as epub:
                epub.add_dir(tdir)
                if encryption is not None:
                    epub.writestr('META-INF/encryption.xml', as_bytes(encryption))
                if metadata_xml is not None:
                    epub.writestr('META-INF/metadata.xml',
                            metadata_xml.encode('utf-8'))
            if opts.extract_to is not None:
                from calibre.utils.zipfile import ZipFile
                if os.path.exists(opts.extract_to):
                    if os.path.isdir(opts.extract_to):
                        shutil.rmtree(opts.extract_to)
                    else:
                        os.remove(opts.extract_to)
                os.mkdir(opts.extract_to)
                with ZipFile(output_path) as zf:
                    zf.extractall(path=opts.extract_to)
                self.log.info('EPUB extracted to', opts.extract_to)
    def upgrade_to_epub3(self, tdir, opf):
        self.log.info('Upgrading to EPUB 3...')
        from calibre.ebooks.epub import simple_container_xml
        from calibre.ebooks.oeb.polish.cover import fix_conversion_titlepage_links_in_nav
        try:
            os.mkdir(os.path.join(tdir, 'META-INF'))
        except EnvironmentError:
            pass
        with open(os.path.join(tdir, 'META-INF', 'container.xml'), 'wb') as f:
            f.write(simple_container_xml(os.path.basename(opf)).encode('utf-8'))
        from calibre.ebooks.oeb.polish.container import EpubContainer
        container = EpubContainer(tdir, self.log)
        from calibre.ebooks.oeb.polish.upgrade import epub_2_to_3
        existing_nav = getattr(self.opts, 'epub3_nav_parsed', None)
        nav_href = getattr(self.opts, 'epub3_nav_href', None)
        previous_nav = (nav_href, existing_nav) if existing_nav and nav_href else None
        epub_2_to_3(container, self.log.info, previous_nav=previous_nav)
        fix_conversion_titlepage_links_in_nav(container)
        container.commit()
        os.remove(f.name)
        try:
            os.rmdir(os.path.join(tdir, 'META-INF'))
        except EnvironmentError:
            pass
    def encrypt_fonts(self, uris, tdir, uuid):  # {{{
        from polyglot.binary import from_hex_bytes
        key = re.sub(r'[^a-fA-F0-9]', '', uuid)
        if len(key) < 16:
            raise ValueError('UUID identifier %r is invalid'%uuid)
        key = bytearray(from_hex_bytes((key + key)[:32]))
        paths = []
        with CurrentDir(tdir):
            paths = [os.path.join(*x.split('/')) for x in uris]
            uris = dict(zip(uris, paths))
            fonts = []
            for uri in list(uris.keys()):
                path = uris[uri]
                if not os.path.exists(path):
                    uris.pop(uri)
                    continue
                self.log.debug('Encrypting font:', uri)
                with lopen(path, 'r+b') as f:
                    data = f.read(1024)
                    if len(data) >= 1024:
                        data = bytearray(data)
                        f.seek(0)
                        f.write(bytes(bytearray(data[i] ^ key[i%16] for i in range(1024))))
                    else:
                        self.log.warn('Font', path, 'is invalid, ignoring')
                if not isinstance(uri, unicode_type):
                    uri = uri.decode('utf-8')
                fonts.append('''
                <enc:EncryptedData>
                    <enc:EncryptionMethod Algorithm="http://ns.adobe.com/pdf/enc#RC"/>
                    <enc:CipherData>
                    <enc:CipherReference URI="%s"/>
                    </enc:CipherData>
                </enc:EncryptedData>
                '''%(uri.replace('"', '\\"')))
            if fonts:
                ans = '''<encryption
                    xmlns="urn:oasis:names:tc:opendocument:xmlns:container"
                    xmlns:enc="http://www.w3.org/2001/04/xmlenc#"
                    xmlns:deenc="http://ns.adobe.com/digitaleditions/enc">
                    '''
                ans += '\n'.join(fonts)
                ans += '\n</encryption>'
                return ans
    # }}}
    def condense_ncx(self, ncx_path):  # {{{
        from lxml import etree
        if not self.opts.pretty_print:
            tree = etree.parse(ncx_path)
            for tag in tree.getroot().iter(tag=etree.Element):
                if tag.text:
                    tag.text = tag.text.strip()
                if tag.tail:
                    tag.tail = tag.tail.strip()
            compressed = etree.tostring(tree.getroot(), encoding='utf-8')
            with open(ncx_path, 'wb') as f:
                f.write(compressed)
    # }}}
    def workaround_ade_quirks(self):  # {{{
        '''
        Perform various markup transforms to get the output to render correctly
        in the quirky ADE.
        '''
        from calibre.ebooks.oeb.base import XPath, XHTML, barename, urlunquote
        stylesheet = self.oeb.manifest.main_stylesheet
        # ADE cries big wet tears when it encounters an invalid fragment
        # identifier in the NCX toc.
        frag_pat = re.compile(r'[-A-Za-z0-9_:.]+$')
        for node in self.oeb.toc.iter():
            href = getattr(node, 'href', None)
            if hasattr(href, 'partition'):
                base, _, frag = href.partition('#')
                frag = urlunquote(frag)
                if frag and frag_pat.match(frag) is None:
                    self.log.warn(
                            'Removing fragment identifier %r from TOC as Adobe Digital Editions cannot handle it'%frag)
                    node.href = base
        for x in self.oeb.spine:
            root = x.data
            body = XPath('//h:body')(root)
            if body:
                body = body[0]
            if hasattr(body, 'xpath'):
                # remove <img> tags with empty src elements
                bad = []
                for x in XPath('//h:img')(body):
                    src = x.get('src', '').strip()
                    if src in ('', '#') or src.startswith('http:'):
                        bad.append(x)
                for img in bad:
                    img.getparent().remove(img)
                # Add id attribute to <a> tags that have name
                for x in XPath('//h:a[@name]')(body):
                    if not x.get('id', False):
                        x.set('id', x.get('name'))
                    # The delightful epubcheck has started complaining about <a> tags that
                    # have name attributes.
                    x.attrib.pop('name')
                # Replace <br> that are children of <body> as ADE doesn't handle them
                for br in XPath('./h:br')(body):
                    if br.getparent() is None:
                        continue
                    try:
                        prior = next(br.itersiblings(preceding=True))
                        priortag = barename(prior.tag)
                        priortext = prior.tail
                    except:
                        priortag = 'body'
                        priortext = body.text
                    if priortext:
                        priortext = priortext.strip()
                    br.tag = XHTML('p')
                    br.text = '\u00a0'
                    style = br.get('style', '').split(';')
                    style = list(filter(None, map(lambda x: x.strip(), style)))
                    style.append('margin:0pt; border:0pt')
                    # If the prior tag is a block (including a <br> we replaced)
                    # then this <br> replacement should have a 1-line height.
                    # Otherwise it should have no height.
                    if not priortext and priortag in block_level_tags:
                        style.append('height:1em')
                    else:
                        style.append('height:0pt')
                    br.set('style', '; '.join(style))
            for tag in XPath('//h:embed')(root):
                tag.getparent().remove(tag)
            for tag in XPath('//h:object')(root):
                if tag.get('type', '').lower().strip() in {'image/svg+xml', 'application/svg+xml'}:
                    continue
                tag.getparent().remove(tag)
            for tag in XPath('//h:title|//h:style')(root):
                if not tag.text:
                    tag.getparent().remove(tag)
            for tag in XPath('//h:script')(root):
                if (not tag.text and not tag.get('src', False) and tag.get('type', None) != 'text/x-mathjax-config'):
                    tag.getparent().remove(tag)
            for tag in XPath('//h:body/descendant::h:script')(root):
                tag.getparent().remove(tag)
            formchildren = XPath('./h:input|./h:button|./h:textarea|'
                    './h:label|./h:fieldset|./h:legend')
            for tag in XPath('//h:form')(root):
                if formchildren(tag):
                    tag.getparent().remove(tag)
                else:
                    # Not a real form
                    tag.tag = XHTML('div')
            for tag in XPath('//h:center')(root):
                tag.tag = XHTML('div')
                tag.set('style', 'text-align:center')
            # ADE can't handle &amp; in an img url
            for tag in XPath('//h:img[@src]')(root):
                tag.set('src', tag.get('src', '').replace('&', ''))
            # ADE whimpers in fright when it encounters a <td> outside a
            # <table>
            in_table = XPath('ancestor::h:table')
            for tag in XPath('//h:td|//h:tr|//h:th')(root):
                if not in_table(tag):
                    tag.tag = XHTML('div')
            # ADE fails to render non breaking hyphens/soft hyphens/zero width spaces
            special_chars = re.compile('[\u200b\u00ad]')
            for elem in root.iterdescendants('*'):
                if elem.text:
                    elem.text = special_chars.sub('', elem.text)
                    elem.text = elem.text.replace('\u2011', '-')
                if elem.tail:
                    elem.tail = special_chars.sub('', elem.tail)
                    elem.tail = elem.tail.replace('\u2011', '-')
            if stylesheet is not None:
                # ADE doesn't render lists correctly if they have left margins
                from css_parser.css import CSSRule
                for lb in XPath('//h:ul[@class]|//h:ol[@class]')(root):
                    sel = '.'+lb.get('class')
                    for rule in stylesheet.data.cssRules.rulesOfType(CSSRule.STYLE_RULE):
                        if sel == rule.selectorList.selectorText:
                            rule.style.removeProperty('margin-left')
                            # padding-left breaks rendering in webkit and gecko
                            rule.style.removeProperty('padding-left')
                # Change whitespace:pre to pre-wrap to accommodate readers that
                # cannot scroll horizontally
                for rule in stylesheet.data.cssRules.rulesOfType(CSSRule.STYLE_RULE):
                    style = rule.style
                    ws = style.getPropertyValue('white-space')
                    if ws == 'pre':
                        style.setProperty('white-space', 'pre-wrap')
    # }}}
    def workaround_sony_quirks(self):  # {{{
        '''
        Perform toc link transforms to alleviate slow loading.
        '''
        from calibre.ebooks.oeb.base import urldefrag, XPath
        from calibre.ebooks.oeb.polish.toc import item_at_top
        def frag_is_at_top(root, frag):
            elem = XPath('//*[@id="%s" or @name="%s"]'%(frag, frag))(root)
            if elem:
                elem = elem[0]
            else:
                return False
            return item_at_top(elem)
        def simplify_toc_entry(toc):
            if toc.href:
                href, frag = urldefrag(toc.href)
                if frag:
                    for x in self.oeb.spine:
                        if x.href == href:
                            if frag_is_at_top(x.data, frag):
                                self.log.debug('Removing anchor from TOC href:',
                                        href+'#'+frag)
                                toc.href = href
                            break
            for x in toc:
                simplify_toc_entry(x)
        if self.oeb.toc:
            simplify_toc_entry(self.oeb.toc)
    # }}}
--- a/ebook_converter/ebooks/conversion/plugins/fb2_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/fb2_input.py
@@ -0,0 +1,179 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Anatoly Shipitsin <norguhtar at gmail.com>'
 """
 Convert .fb2 files to .lrf
 """
 import os, re
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from calibre import guess_type
 from polyglot.builtins import iteritems, getcwd
 FB2NS  = 'http://www.gribuser.ru/xml/fictionbook/2.0'
 FB21NS = 'http://www.gribuser.ru/xml/fictionbook/2.1'
 class FB2Input(InputFormatPlugin):
    name        = 'FB2 Input'
    author      = 'Anatoly Shipitsin'
    description = 'Convert FB2 and FBZ files to HTML'
    file_types  = {'fb2', 'fbz'}
    commit_name = 'fb2_input'
    recommendations = {
        ('level1_toc', '//h:h1', OptionRecommendation.MED),
        ('level2_toc', '//h:h2', OptionRecommendation.MED),
        ('level3_toc', '//h:h3', OptionRecommendation.MED),
        }
    options = {
    OptionRecommendation(name='no_inline_fb2_toc',
        recommended_value=False, level=OptionRecommendation.LOW,
        help=_('Do not insert a Table of Contents at the beginning of the book.'
                )
        )}
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from lxml import etree
        from calibre.utils.xml_parse import safe_xml_fromstring
        from calibre.ebooks.metadata.fb2 import ensure_namespace, get_fb2_data
        from calibre.ebooks.metadata.opf2 import OPFCreator
        from calibre.ebooks.metadata.meta import get_metadata
        from calibre.ebooks.oeb.base import XLINK_NS, XHTML_NS
        from calibre.ebooks.chardet import xml_to_unicode
        self.log = log
        log.debug('Parsing XML...')
        raw = get_fb2_data(stream)[0]
        raw = raw.replace(b'\0', b'')
        raw = xml_to_unicode(raw, strip_encoding_pats=True,
            assume_utf8=True, resolve_entities=True)[0]
        try:
            doc = safe_xml_fromstring(raw)
        except etree.XMLSyntaxError:
            doc = safe_xml_fromstring(raw.replace('& ', '&amp;'))
        if doc is None:
            raise ValueError('The FB2 file is not valid XML')
        doc = ensure_namespace(doc)
        try:
            fb_ns = doc.nsmap[doc.prefix]
        except Exception:
            fb_ns = FB2NS
        NAMESPACES = {'f':fb_ns, 'l':XLINK_NS}
        stylesheets = doc.xpath('//*[local-name() = "stylesheet" and @type="text/css"]')
        css = ''
        for s in stylesheets:
            css += etree.tostring(s, encoding='unicode', method='text',
                    with_tail=False) + '\n\n'
        if css:
            import css_parser, logging
            parser = css_parser.CSSParser(fetcher=None,
                    log=logging.getLogger('calibre.css'))
            XHTML_CSS_NAMESPACE = '@namespace "%s";\n' % XHTML_NS
            text = XHTML_CSS_NAMESPACE + css
            log.debug('Parsing stylesheet...')
            stylesheet = parser.parseString(text)
            stylesheet.namespaces['h'] = XHTML_NS
            css = stylesheet.cssText
            if isinstance(css, bytes):
                css = css.decode('utf-8', 'replace')
            css = css.replace('h|style', 'h|span')
            css = re.sub(r'name\s*=\s*', 'class=', css)
        self.extract_embedded_content(doc)
        log.debug('Converting XML to HTML...')
        with open(P('templates/fb2.xsl'), 'rb') as f:
            ss = f.read().decode('utf-8')
        ss = ss.replace("__FB_NS__", fb_ns)
        if options.no_inline_fb2_toc:
            log('Disabling generation of inline FB2 TOC')
            ss = re.compile(r'<!-- BUILD TOC -->.*<!-- END BUILD TOC -->',
                    re.DOTALL).sub('', ss)
        styledoc = safe_xml_fromstring(ss)
        transform = etree.XSLT(styledoc)
        result = transform(doc)
        # Handle links of type note and cite
        notes = {a.get('href')[1:]: a for a in result.xpath('//a[@link_note and @href]') if a.get('href').startswith('#')}
        cites = {a.get('link_cite'): a for a in result.xpath('//a[@link_cite]') if not a.get('href', '')}
        all_ids = {x for x in result.xpath('//*/@id')}
        for cite, a in iteritems(cites):
            note = notes.get(cite, None)
            if note:
                c = 1
                while 'cite%d' % c in all_ids:
                    c += 1
                if not note.get('id', None):
                    note.set('id', 'cite%d' % c)
                    all_ids.add(note.get('id'))
                a.set('href', '#%s' % note.get('id'))
        for x in result.xpath('//*[@link_note or @link_cite]'):
            x.attrib.pop('link_note', None)
            x.attrib.pop('link_cite', None)
        for img in result.xpath('//img[@src]'):
            src = img.get('src')
            img.set('src', self.binary_map.get(src, src))
        index = transform.tostring(result)
        with open('index.xhtml', 'wb') as f:
            f.write(index.encode('utf-8'))
        with open('inline-styles.css', 'wb') as f:
            f.write(css.encode('utf-8'))
        stream.seek(0)
        mi = get_metadata(stream, 'fb2')
        if not mi.title:
            mi.title = _('Unknown')
        if not mi.authors:
            mi.authors = [_('Unknown')]
        cpath = None
        if mi.cover_data and mi.cover_data[1]:
            with open('fb2_cover_calibre_mi.jpg', 'wb') as f:
                f.write(mi.cover_data[1])
            cpath = os.path.abspath('fb2_cover_calibre_mi.jpg')
        else:
            for img in doc.xpath('//f:coverpage/f:image', namespaces=NAMESPACES):
                href = img.get('{%s}href'%XLINK_NS, img.get('href', None))
                if href is not None:
                    if href.startswith('#'):
                        href = href[1:]
                    cpath = os.path.abspath(href)
                    break
        opf = OPFCreator(getcwd(), mi)
        entries = [(f2, guess_type(f2)[0]) for f2 in os.listdir(u'.')]
        opf.create_manifest(entries)
        opf.create_spine(['index.xhtml'])
        if cpath:
            opf.guide.set_cover(cpath)
        with open('metadata.opf', 'wb') as f:
            opf.render(f)
        return os.path.join(getcwd(), 'metadata.opf')
    def extract_embedded_content(self, doc):
        from calibre.ebooks.fb2 import base64_decode
        self.binary_map = {}
        for elem in doc.xpath('./*'):
            if elem.text and 'binary' in elem.tag and 'id' in elem.attrib:
                ct = elem.get('content-type', '')
                fname = elem.attrib['id']
                ext = ct.rpartition('/')[-1].lower()
                if ext in ('png', 'jpeg', 'jpg'):
                    if fname.lower().rpartition('.')[-1] not in {'jpg', 'jpeg',
                            'png'}:
                        fname += '.' + ext
                    self.binary_map[elem.get('id')] = fname
                raw = elem.text.strip()
                try:
                    data = base64_decode(raw)
                except TypeError:
                    self.log.exception('Binary data with id=%s is corrupted, ignoring'%(
                        elem.get('id')))
                else:
                    with open(fname, 'wb') as f:
                        f.write(data)
--- a/ebook_converter/ebooks/conversion/plugins/fb2_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/fb2_output.py
@@ -0,0 +1,203 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
 class FB2Output(OutputFormatPlugin):
    name = 'FB2 Output'
    author = 'John Schember'
    file_type = 'fb2'
    commit_name = 'fb2_output'
    FB2_GENRES = [
        # Science Fiction & Fantasy
        'sf_history',  # Alternative history
        'sf_action',  # Action
        'sf_epic',  # Epic
        'sf_heroic',  # Heroic
        'sf_detective',  # Detective
        'sf_cyberpunk',  # Cyberpunk
        'sf_space',  # Space
        'sf_social',  # Social#philosophical
        'sf_horror',  # Horror & mystic
        'sf_humor',  # Humor
        'sf_fantasy',  # Fantasy
        'sf',  # Science Fiction
        # Detectives & Thrillers
        'det_classic',  # Classical detectives
        'det_police',  # Police Stories
        'det_action',  # Action
        'det_irony',  # Ironical detectives
        'det_history',  # Historical detectives
        'det_espionage',  # Espionage detectives
        'det_crime',  # Crime detectives
        'det_political',  # Political detectives
        'det_maniac',  # Maniacs
        'det_hard',  # Hard#boiled
        'thriller',  # Thrillers
        'detective',  # Detectives
        # Prose
        'prose_classic',  # Classics prose
        'prose_history',  # Historical prose
        'prose_contemporary',  # Contemporary prose
        'prose_counter',  # Counterculture
        'prose_rus_classic',  # Russial classics prose
        'prose_su_classics',  # Soviet classics prose
        # Romance
        'love_contemporary',  # Contemporary Romance
        'love_history',  # Historical Romance
        'love_detective',  # Detective Romance
        'love_short',  # Short Romance
        'love_erotica',  # Erotica
        # Adventure
        'adv_western',  # Western
        'adv_history',  # History
        'adv_indian',  # Indians
        'adv_maritime',  # Maritime Fiction
        'adv_geo',  # Travel & geography
        'adv_animal',  # Nature & animals
        'adventure',  # Other
        # Children's
        'child_tale',  # Fairy Tales
        'child_verse',  # Verses
        'child_prose',  # Prose
        'child_sf',  # Science Fiction
        'child_det',  # Detectives & Thrillers
        'child_adv',  # Adventures
        'child_education',  # Educational
        'children',  # Other
        # Poetry & Dramaturgy
        'poetry',  # Poetry
        'dramaturgy',  # Dramaturgy
        # Antique literature
        'antique_ant',  # Antique
        'antique_european',  # European
        'antique_russian',  # Old russian
        'antique_east',  # Old east
        'antique_myths',  # Myths. Legends. Epos
        'antique',  # Other
        # Scientific#educational
        'sci_history',  # History
        'sci_psychology',  # Psychology
        'sci_culture',  # Cultural science
        'sci_religion',  # Religious studies
        'sci_philosophy',  # Philosophy
        'sci_politics',  # Politics
        'sci_business',  # Business literature
        'sci_juris',  # Jurisprudence
        'sci_linguistic',  # Linguistics
        'sci_medicine',  # Medicine
        'sci_phys',  # Physics
        'sci_math',  # Mathematics
        'sci_chem',  # Chemistry
        'sci_biology',  # Biology
        'sci_tech',  # Technical
        'science',  # Other
        # Computers & Internet
        'comp_www',  # Internet
        'comp_programming',  # Programming
        'comp_hard',  # Hardware
        'comp_soft',  # Software
        'comp_db',  # Databases
        'comp_osnet',  # OS & Networking
        'computers',  # Other
        # Reference
        'ref_encyc',  # Encyclopedias
        'ref_dict',  # Dictionaries
        'ref_ref',  # Reference
        'ref_guide',  # Guidebooks
        'reference',  # Other
        # Nonfiction
        'nonf_biography',  # Biography & Memoirs
        'nonf_publicism',  # Publicism
        'nonf_criticism',  # Criticism
        'design',  # Art & design
        'nonfiction',  # Other
        # Religion & Inspiration
        'religion_rel',  # Religion
        'religion_esoterics',  # Esoterics
        'religion_self',  # Self#improvement
        'religion',  # Other
        # Humor
        'humor_anecdote',  # Anecdote (funny stories)
        'humor_prose',  # Prose
        'humor_verse',  # Verses
        'humor',  # Other
        # Home & Family
        'home_cooking',  # Cooking
        'home_pets',  # Pets
        'home_crafts',  # Hobbies & Crafts
        'home_entertain',  # Entertaining
        'home_health',  # Health
        'home_garden',  # Garden
        'home_diy',  # Do it yourself
        'home_sport',  # Sports
        'home_sex',  # Erotica & sex
        'home',  # Other
    ]
    ui_data = {
        'sectionize': {
            'toc': _('Section per entry in the ToC'),
            'files': _('Section per file'),
            'nothing': _('A single section')
        },
        'genres': FB2_GENRES,
    }
    options = {
        OptionRecommendation(name='sectionize',
            recommended_value='files', level=OptionRecommendation.LOW,
            choices=list(ui_data['sectionize']),
            help=_('Specify how sections are created:\n'
                ' * nothing: {nothing}\n'
                ' * files: {files}\n'
                ' * toc: {toc}\n'
                'If ToC based generation fails, adjust the "Structure detection" and/or "Table of Contents" settings '
                '(turn on "Force use of auto-generated Table of Contents").').format(**ui_data['sectionize'])
        ),
        OptionRecommendation(name='fb2_genre',
            recommended_value='antique', level=OptionRecommendation.LOW,
            choices=FB2_GENRES,
            help=(_('Genre for the book. Choices: %s\n\n See: ') % ', '.join(FB2_GENRES)
                ) + 'http://www.fictionbook.org/index.php/Eng:FictionBook_2.1_genres ' + _('for a complete list with descriptions.')),
    }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.oeb.transforms.jacket import linearize_jacket
        from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
        from calibre.ebooks.fb2.fb2ml import FB2MLizer
        try:
            rasterizer = SVGRasterizer()
            rasterizer(oeb_book, opts)
        except Unavailable:
            log.warn('SVG rasterizer unavailable, SVG will not be converted')
        linearize_jacket(oeb_book)
        fb2mlizer = FB2MLizer(log)
        fb2_content = fb2mlizer.extract_content(oeb_book, opts)
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
                os.makedirs(os.path.dirname(output_path))
            out_stream = lopen(output_path, 'wb')
        else:
            out_stream = output_path
        out_stream.seek(0)
        out_stream.truncate()
        out_stream.write(fb2_content.encode('utf-8', 'replace'))
        if close:
            out_stream.close()
--- a/ebook_converter/ebooks/conversion/plugins/html_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/html_input.py
@@ -0,0 +1,316 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2012, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import re, tempfile, os
 from functools import partial
 from calibre.constants import islinux, isbsd
 from calibre.customize.conversion import (InputFormatPlugin,
        OptionRecommendation)
 from calibre.utils.localization import get_lang
 from calibre.utils.filenames import ascii_filename
 from calibre.utils.imghdr import what
 from polyglot.builtins import unicode_type, zip, getcwd, as_unicode
 def sanitize_file_name(x):
    ans = re.sub(r'\s+', ' ', re.sub(r'[?&=;#]', '_', ascii_filename(x))).strip().rstrip('.')
    ans, ext = ans.rpartition('.')[::2]
    return (ans.strip() + '.' + ext.strip()).rstrip('.')
 class HTMLInput(InputFormatPlugin):
    name        = 'HTML Input'
    author      = 'Kovid Goyal'
    description = 'Convert HTML and OPF files to an OEB'
    file_types  = {'opf', 'html', 'htm', 'xhtml', 'xhtm', 'shtm', 'shtml'}
    commit_name = 'html_input'
    options = {
        OptionRecommendation(name='breadth_first',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Traverse links in HTML files breadth first. Normally, '
                    'they are traversed depth first.'
                   )
        ),
        OptionRecommendation(name='max_levels',
            recommended_value=5, level=OptionRecommendation.LOW,
            help=_('Maximum levels of recursion when following links in '
                   'HTML files. Must be non-negative. 0 implies that no '
                   'links in the root HTML file are followed. Default is '
                   '%default.'
                   )
        ),
        OptionRecommendation(name='dont_package',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Normally this input plugin re-arranges all the input '
                'files into a standard folder hierarchy. Only use this option '
                'if you know what you are doing as it can result in various '
                'nasty side effects in the rest of the conversion pipeline.'
                )
        ),
    }
    def convert(self, stream, opts, file_ext, log,
                accelerators):
        self._is_case_sensitive = None
        basedir = getcwd()
        self.opts = opts
        fname = None
        if hasattr(stream, 'name'):
            basedir = os.path.dirname(stream.name)
            fname = os.path.basename(stream.name)
        if file_ext != 'opf':
            if opts.dont_package:
                raise ValueError('The --dont-package option is not supported for an HTML input file')
            from calibre.ebooks.metadata.html import get_metadata
            mi = get_metadata(stream)
            if fname:
                from calibre.ebooks.metadata.meta import metadata_from_filename
                fmi = metadata_from_filename(fname)
                fmi.smart_update(mi)
                mi = fmi
            oeb = self.create_oebbook(stream.name, basedir, opts, log, mi)
            return oeb
        from calibre.ebooks.conversion.plumber import create_oebbook
        return create_oebbook(log, stream.name, opts,
                encoding=opts.input_encoding)
    def is_case_sensitive(self, path):
        if getattr(self, '_is_case_sensitive', None) is not None:
            return self._is_case_sensitive
        if not path or not os.path.exists(path):
            return islinux or isbsd
        self._is_case_sensitive = not (os.path.exists(path.lower()) and os.path.exists(path.upper()))
        return self._is_case_sensitive
    def create_oebbook(self, htmlpath, basedir, opts, log, mi):
        import uuid
        from calibre.ebooks.conversion.plumber import create_oebbook
        from calibre.ebooks.oeb.base import (DirContainer,
            rewrite_links, urlnormalize, urldefrag, BINARY_MIME, OEB_STYLES,
            xpath, urlquote)
        from calibre import guess_type
        from calibre.ebooks.oeb.transforms.metadata import \
            meta_info_to_oeb_metadata
        from calibre.ebooks.html.input import get_filelist
        from calibre.ebooks.metadata import string_to_authors
        from calibre.utils.localization import canonicalize_lang
        import css_parser, logging
        css_parser.log.setLevel(logging.WARN)
        self.OEB_STYLES = OEB_STYLES
        oeb = create_oebbook(log, None, opts, self,
                encoding=opts.input_encoding, populate=False)
        self.oeb = oeb
        metadata = oeb.metadata
        meta_info_to_oeb_metadata(mi, metadata, log)
        if not metadata.language:
            l = canonicalize_lang(getattr(opts, 'language', None))
            if not l:
                oeb.logger.warn('Language not specified')
                l = get_lang().replace('_', '-')
            metadata.add('language', l)
        if not metadata.creator:
            a = getattr(opts, 'authors', None)
            if a:
                a = string_to_authors(a)
            if not a:
                oeb.logger.warn('Creator not specified')
                a = [self.oeb.translate(__('Unknown'))]
            for aut in a:
                metadata.add('creator', aut)
        if not metadata.title:
            oeb.logger.warn('Title not specified')
            metadata.add('title', self.oeb.translate(__('Unknown')))
        bookid = unicode_type(uuid.uuid4())
        metadata.add('identifier', bookid, id='uuid_id', scheme='uuid')
        for ident in metadata.identifier:
            if 'id' in ident.attrib:
                self.oeb.uid = metadata.identifier[0]
                break
        filelist = get_filelist(htmlpath, basedir, opts, log)
        filelist = [f for f in filelist if not f.is_binary]
        htmlfile_map = {}
        for f in filelist:
            path = f.path
            oeb.container = DirContainer(os.path.dirname(path), log,
                    ignore_opf=True)
            bname = os.path.basename(path)
            id, href = oeb.manifest.generate(id='html', href=sanitize_file_name(bname))
            htmlfile_map[path] = href
            item = oeb.manifest.add(id, href, 'text/html')
            if path == htmlpath and '%' in path:
                bname = urlquote(bname)
            item.html_input_href = bname
            oeb.spine.add(item, True)
        self.added_resources = {}
        self.log = log
        self.log('Normalizing filename cases')
        for path, href in htmlfile_map.items():
            if not self.is_case_sensitive(path):
                path = path.lower()
            self.added_resources[path] = href
        self.urlnormalize, self.DirContainer = urlnormalize, DirContainer
        self.urldefrag = urldefrag
        self.guess_type, self.BINARY_MIME = guess_type, BINARY_MIME
        self.log('Rewriting HTML links')
        for f in filelist:
            path = f.path
            dpath = os.path.dirname(path)
            oeb.container = DirContainer(dpath, log, ignore_opf=True)
            href = htmlfile_map[path]
            try:
                item = oeb.manifest.hrefs[href]
            except KeyError:
                item = oeb.manifest.hrefs[urlnormalize(href)]
            rewrite_links(item.data, partial(self.resource_adder, base=dpath))
        for item in oeb.manifest.values():
            if item.media_type in self.OEB_STYLES:
                dpath = None
                for path, href in self.added_resources.items():
                    if href == item.href:
                        dpath = os.path.dirname(path)
                        break
                css_parser.replaceUrls(item.data,
                        partial(self.resource_adder, base=dpath))
        toc = self.oeb.toc
        self.oeb.auto_generated_toc = True
        titles = []
        headers = []
        for item in self.oeb.spine:
            if not item.linear:
                continue
            html = item.data
            title = ''.join(xpath(html, '/h:html/h:head/h:title/text()'))
            title = re.sub(r'\s+', ' ', title.strip())
            if title:
                titles.append(title)
            headers.append('(unlabled)')
            for tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'strong'):
                expr = '/h:html/h:body//h:%s[position()=1]/text()'
                header = ''.join(xpath(html, expr % tag))
                header = re.sub(r'\s+', ' ', header.strip())
                if header:
                    headers[-1] = header
                    break
        use = titles
        if len(titles) > len(set(titles)):
            use = headers
        for title, item in zip(use, self.oeb.spine):
            if not item.linear:
                continue
            toc.add(title, item.href)
        oeb.container = DirContainer(getcwd(), oeb.log, ignore_opf=True)
        return oeb
    def link_to_local_path(self, link_, base=None):
        from calibre.ebooks.html.input import Link
        if not isinstance(link_, unicode_type):
            try:
                link_ = link_.decode('utf-8', 'error')
            except:
                self.log.warn('Failed to decode link %r. Ignoring'%link_)
                return None, None
        try:
            l = Link(link_, base if base else getcwd())
        except:
            self.log.exception('Failed to process link: %r'%link_)
            return None, None
        if l.path is None:
            # Not a local resource
            return None, None
        link = l.path.replace('/', os.sep).strip()
        frag = l.fragment
        if not link:
            return None, None
        return link, frag
    def resource_adder(self, link_, base=None):
        from polyglot.urllib import quote
        link, frag = self.link_to_local_path(link_, base=base)
        if link is None:
            return link_
        try:
            if base and not os.path.isabs(link):
                link = os.path.join(base, link)
            link = os.path.abspath(link)
        except:
            return link_
        if not os.access(link, os.R_OK):
            return link_
        if os.path.isdir(link):
            self.log.warn(link_, 'is a link to a directory. Ignoring.')
            return link_
        if not self.is_case_sensitive(tempfile.gettempdir()):
            link = link.lower()
        if link not in self.added_resources:
            bhref = os.path.basename(link)
            id, href = self.oeb.manifest.generate(id='added', href=sanitize_file_name(bhref))
            guessed = self.guess_type(href)[0]
            media_type = guessed or self.BINARY_MIME
            if media_type == 'text/plain':
                self.log.warn('Ignoring link to text file %r'%link_)
                return None
            if media_type == self.BINARY_MIME:
                # Check for the common case, images
                try:
                    img = what(link)
                except EnvironmentError:
                    pass
                else:
                    if img:
                        media_type = self.guess_type('dummy.'+img)[0] or self.BINARY_MIME
            self.oeb.log.debug('Added', link)
            self.oeb.container = self.DirContainer(os.path.dirname(link),
                    self.oeb.log, ignore_opf=True)
            # Load into memory
            item = self.oeb.manifest.add(id, href, media_type)
            # bhref refers to an already existing file. The read() method of
            # DirContainer will call unquote on it before trying to read the
            # file, therefore we quote it here.
            if isinstance(bhref, unicode_type):
                bhref = bhref.encode('utf-8')
            item.html_input_href = as_unicode(quote(bhref))
            if guessed in self.OEB_STYLES:
                item.override_css_fetch = partial(
                        self.css_import_handler, os.path.dirname(link))
            item.data
            self.added_resources[link] = href
        nlink = self.added_resources[link]
        if frag:
            nlink = '#'.join((nlink, frag))
        return nlink
    def css_import_handler(self, base, href):
        link, frag = self.link_to_local_path(href, base=base)
        if link is None or not os.access(link, os.R_OK) or os.path.isdir(link):
            return (None, None)
        try:
            with open(link, 'rb') as f:
                raw = f.read().decode('utf-8', 'replace')
            raw = self.oeb.css_preprocessor(raw, add_namespace=False)
        except:
            self.log.exception('Failed to read CSS file: %r'%link)
            return (None, None)
        return (None, raw)
--- a/ebook_converter/ebooks/conversion/plugins/html_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/html_output.py
@@ -0,0 +1,226 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2010, Fabian Grassl <fg@jusmeum.de>'
 __docformat__ = 'restructuredtext en'
 import os, re, shutil
 from os.path import dirname, abspath, relpath as _relpath, exists, basename
 from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
 from calibre import CurrentDir
 from calibre.ptempfile import PersistentTemporaryDirectory
 from polyglot.builtins import unicode_type
 def relpath(*args):
    return _relpath(*args).replace(os.sep, '/')
 class HTMLOutput(OutputFormatPlugin):
    name = 'HTML Output'
    author = 'Fabian Grassl'
    file_type = 'zip'
    commit_name = 'html_output'
    options = {
        OptionRecommendation(name='template_css',
            help=_('CSS file used for the output instead of the default file')),
        OptionRecommendation(name='template_html_index',
            help=_('Template used for generation of the HTML index file instead of the default file')),
        OptionRecommendation(name='template_html',
            help=_('Template used for the generation of the HTML contents of the book instead of the default file')),
        OptionRecommendation(name='extract_to',
            help=_('Extract the contents of the generated ZIP file to the '
                'specified directory. WARNING: The contents of the directory '
                'will be deleted.')
        ),
    }
    recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
    def generate_toc(self, oeb_book, ref_url, output_dir):
        '''
        Generate table of contents
        '''
        from lxml import etree
        from polyglot.urllib import unquote
        from calibre.ebooks.oeb.base import element
        from calibre.utils.cleantext import clean_xml_chars
        with CurrentDir(output_dir):
            def build_node(current_node, parent=None):
                if parent is None:
                    parent = etree.Element('ul')
                elif len(current_node.nodes):
                    parent = element(parent, ('ul'))
                for node in current_node.nodes:
                    point = element(parent, 'li')
                    href = relpath(abspath(unquote(node.href)), dirname(ref_url))
                    if isinstance(href, bytes):
                        href = href.decode('utf-8')
                    link = element(point, 'a', href=clean_xml_chars(href))
                    title = node.title
                    if isinstance(title, bytes):
                        title = title.decode('utf-8')
                    if title:
                        title = re.sub(r'\s+', ' ', title)
                    link.text = clean_xml_chars(title)
                    build_node(node, point)
                return parent
            wrap = etree.Element('div')
            wrap.append(build_node(oeb_book.toc))
            return wrap
    def generate_html_toc(self, oeb_book, ref_url, output_dir):
        from lxml import etree
        root = self.generate_toc(oeb_book, ref_url, output_dir)
        return etree.tostring(root, pretty_print=True, encoding='unicode',
                xml_declaration=False)
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from lxml import etree
        from calibre.utils import zipfile
        from templite import Templite
        from polyglot.urllib import unquote
        from calibre.ebooks.html.meta import EasyMeta
        # read template files
        if opts.template_html_index is not None:
            with open(opts.template_html_index, 'rb') as f:
                template_html_index_data = f.read()
        else:
            template_html_index_data = P('templates/html_export_default_index.tmpl', data=True)
        if opts.template_html is not None:
            with open(opts.template_html, 'rb') as f:
                template_html_data = f.read()
        else:
            template_html_data = P('templates/html_export_default.tmpl', data=True)
        if opts.template_css is not None:
            with open(opts.template_css, 'rb') as f:
                template_css_data = f.read()
        else:
            template_css_data = P('templates/html_export_default.css', data=True)
        template_html_index_data = template_html_index_data.decode('utf-8')
        template_html_data = template_html_data.decode('utf-8')
        template_css_data = template_css_data.decode('utf-8')
        self.log  = log
        self.opts = opts
        meta = EasyMeta(oeb_book.metadata)
        tempdir = os.path.realpath(PersistentTemporaryDirectory())
        output_file = os.path.join(tempdir,
                basename(re.sub(r'\.zip', '', output_path)+'.html'))
        output_dir = re.sub(r'\.html', '', output_file)+'_files'
        if not exists(output_dir):
            os.makedirs(output_dir)
        css_path = output_dir+os.sep+'calibreHtmlOutBasicCss.css'
        with open(css_path, 'wb') as f:
            f.write(template_css_data.encode('utf-8'))
        with open(output_file, 'wb') as f:
            html_toc = self.generate_html_toc(oeb_book, output_file, output_dir)
            templite = Templite(template_html_index_data)
            nextLink = oeb_book.spine[0].href
            nextLink = relpath(output_dir+os.sep+nextLink, dirname(output_file))
            cssLink = relpath(abspath(css_path), dirname(output_file))
            tocUrl = relpath(output_file, dirname(output_file))
            t = templite.render(has_toc=bool(oeb_book.toc.count()),
                    toc=html_toc, meta=meta, nextLink=nextLink,
                    tocUrl=tocUrl, cssLink=cssLink,
                    firstContentPageLink=nextLink)
            if isinstance(t, unicode_type):
                t = t.encode('utf-8')
            f.write(t)
        with CurrentDir(output_dir):
            for item in oeb_book.manifest:
                path = abspath(unquote(item.href))
                dir = dirname(path)
                if not exists(dir):
                    os.makedirs(dir)
                if item.spine_position is not None:
                    with open(path, 'wb') as f:
                        pass
                else:
                    with open(path, 'wb') as f:
                        f.write(item.bytes_representation)
                    item.unload_data_from_memory(memory=path)
            for item in oeb_book.spine:
                path = abspath(unquote(item.href))
                dir = dirname(path)
                root = item.data.getroottree()
                # get & clean HTML <HEAD>-data
                head = root.xpath('//h:head', namespaces={'h': 'http://www.w3.org/1999/xhtml'})[0]
                head_content = etree.tostring(head, pretty_print=True, encoding='unicode')
                head_content = re.sub(r'\<\/?head.*\>', '', head_content)
                head_content = re.sub(re.compile(r'\<style.*\/style\>', re.M|re.S), '', head_content)
                head_content = re.sub(r'<(title)([^>]*)/>', r'<\1\2></\1>', head_content)
                # get & clean HTML <BODY>-data
                body = root.xpath('//h:body', namespaces={'h': 'http://www.w3.org/1999/xhtml'})[0]
                ebook_content = etree.tostring(body, pretty_print=True, encoding='unicode')
                ebook_content = re.sub(r'\<\/?body.*\>', '', ebook_content)
                ebook_content = re.sub(r'<(div|a|span)([^>]*)/>', r'<\1\2></\1>', ebook_content)
                # generate link to next page
                if item.spine_position+1 < len(oeb_book.spine):
                    nextLink = oeb_book.spine[item.spine_position+1].href
                    nextLink = relpath(abspath(nextLink), dir)
                else:
                    nextLink = None
                # generate link to previous page
                if item.spine_position > 0:
                    prevLink = oeb_book.spine[item.spine_position-1].href
                    prevLink = relpath(abspath(prevLink), dir)
                else:
                    prevLink = None
                cssLink = relpath(abspath(css_path), dir)
                tocUrl = relpath(output_file, dir)
                firstContentPageLink = oeb_book.spine[0].href
                # render template
                templite = Templite(template_html_data)
                toc = lambda: self.generate_html_toc(oeb_book, path, output_dir)
                t = templite.render(ebookContent=ebook_content,
                        prevLink=prevLink, nextLink=nextLink,
                        has_toc=bool(oeb_book.toc.count()), toc=toc,
                        tocUrl=tocUrl, head_content=head_content,
                        meta=meta, cssLink=cssLink,
                        firstContentPageLink=firstContentPageLink)
                # write html to file
                with open(path, 'wb') as f:
                    f.write(t.encode('utf-8'))
                item.unload_data_from_memory(memory=path)
        zfile = zipfile.ZipFile(output_path, "w")
        zfile.add_dir(output_dir, basename(output_dir))
        zfile.write(output_file, basename(output_file), zipfile.ZIP_DEFLATED)
        if opts.extract_to:
            if os.path.exists(opts.extract_to):
                shutil.rmtree(opts.extract_to)
            os.makedirs(opts.extract_to)
            zfile.extractall(opts.extract_to)
            self.log('Zip file extracted to', opts.extract_to)
        zfile.close()
        # cleanup temp dir
        shutil.rmtree(tempdir)
--- a/ebook_converter/ebooks/conversion/plugins/htmlz_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/htmlz_input.py
@@ -0,0 +1,133 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2011, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre import guess_type
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import getcwd
 class HTMLZInput(InputFormatPlugin):
    name        = 'HTLZ Input'
    author      = 'John Schember'
    description = 'Convert HTML files to HTML'
    file_types  = {'htmlz'}
    commit_name = 'htmlz_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.chardet import xml_to_unicode
        from calibre.ebooks.metadata.opf2 import OPF
        from calibre.utils.zipfile import ZipFile
        self.log = log
        html = u''
        top_levels = []
        # Extract content from zip archive.
        zf = ZipFile(stream)
        zf.extractall()
        # Find the HTML file in the archive. It needs to be
        # top level.
        index = u''
        multiple_html = False
        # Get a list of all top level files in the archive.
        for x in os.listdir(u'.'):
            if os.path.isfile(x):
                top_levels.append(x)
        # Try to find an index. file.
        for x in top_levels:
            if x.lower() in (u'index.html', u'index.xhtml', u'index.htm'):
                index = x
                break
        # Look for multiple HTML files in the archive. We look at the
        # top level files only as only they matter in HTMLZ.
        for x in top_levels:
            if os.path.splitext(x)[1].lower() in (u'.html', u'.xhtml', u'.htm'):
                # Set index to the first HTML file found if it's not
                # called index.
                if not index:
                    index = x
                else:
                    multiple_html = True
        # Warn the user if there multiple HTML file in the archive. HTMLZ
        # supports a single HTML file. A conversion with a multiple HTML file
        # HTMLZ archive probably won't turn out as the user expects. With
        # Multiple HTML files ZIP input should be used in place of HTMLZ.
        if multiple_html:
            log.warn(_('Multiple HTML files found in the archive. Only %s will be used.') % index)
        if index:
            with open(index, 'rb') as tf:
                html = tf.read()
        else:
            raise Exception(_('No top level HTML file found.'))
        if not html:
            raise Exception(_('Top level HTML file %s is empty') % index)
        # Encoding
        if options.input_encoding:
            ienc = options.input_encoding
        else:
            ienc = xml_to_unicode(html[:4096])[-1]
        html = html.decode(ienc, 'replace')
        # Run the HTML through the html processing plugin.
        from calibre.customize.ui import plugin_for_input_format
        html_input = plugin_for_input_format('html')
        for opt in html_input.options:
            setattr(options, opt.option.name, opt.recommended_value)
        options.input_encoding = 'utf-8'
        base = getcwd()
        htmlfile = os.path.join(base, u'index.html')
        c = 0
        while os.path.exists(htmlfile):
            c += 1
            htmlfile = u'index%d.html'%c
        with open(htmlfile, 'wb') as f:
            f.write(html.encode('utf-8'))
        odi = options.debug_pipeline
        options.debug_pipeline = None
        # Generate oeb from html conversion.
        with open(htmlfile, 'rb') as f:
            oeb = html_input.convert(f, options, 'html', log,
                {})
        options.debug_pipeline = odi
        os.remove(htmlfile)
        # Set metadata from file.
        from calibre.customize.ui import get_file_type_metadata
        from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
        mi = get_file_type_metadata(stream, file_ext)
        meta_info_to_oeb_metadata(mi, oeb.metadata, log)
        # Get the cover path from the OPF.
        cover_path = None
        opf = None
        for x in top_levels:
            if os.path.splitext(x)[1].lower() == u'.opf':
                opf = x
                break
        if opf:
            opf = OPF(opf, basedir=getcwd())
            cover_path = opf.raster_cover or opf.cover
        # Set the cover.
        if cover_path:
            cdata = None
            with open(os.path.join(getcwd(), cover_path), 'rb') as cf:
                cdata = cf.read()
            cover_name = os.path.basename(cover_path)
            id, href = oeb.manifest.generate('cover', cover_name)
            oeb.manifest.add(id, href, guess_type(cover_name)[0], data=cdata)
            oeb.guide.add('cover', 'Cover', href)
        return oeb
--- a/ebook_converter/ebooks/conversion/plugins/htmlz_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/htmlz_output.py
@@ -0,0 +1,136 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2011, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import io
 import os
 from calibre.customize.conversion import OutputFormatPlugin, \
    OptionRecommendation
 from calibre.ptempfile import TemporaryDirectory
 from polyglot.builtins import unicode_type
 class HTMLZOutput(OutputFormatPlugin):
    name = 'HTMLZ Output'
    author = 'John Schember'
    file_type = 'htmlz'
    commit_name = 'htmlz_output'
    ui_data = {
            'css_choices': {
                'class': _('Use CSS classes'),
                'inline': _('Use the style attribute'),
                'tag': _('Use HTML tags wherever possible')
            },
            'sheet_choices': {
                'external': _('Use an external CSS file'),
                'inline': _('Use a <style> tag in the HTML file')
            }
    }
    options = {
        OptionRecommendation(name='htmlz_css_type', recommended_value='class',
            level=OptionRecommendation.LOW,
            choices=list(ui_data['css_choices']),
            help=_('Specify the handling of CSS. Default is class.\n'
                   'class: {class}\n'
                   'inline: {inline}\n'
                   'tag: {tag}'
            ).format(**ui_data['css_choices'])),
        OptionRecommendation(name='htmlz_class_style', recommended_value='external',
            level=OptionRecommendation.LOW,
            choices=list(ui_data['sheet_choices']),
            help=_('How to handle the CSS when using css-type = \'class\'.\n'
                   'Default is external.\n'
                   'external: {external}\n'
                   'inline: {inline}'
            ).format(**ui_data['sheet_choices'])),
        OptionRecommendation(name='htmlz_title_filename',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('If set this option causes the file name of the HTML file'
                ' inside the HTMLZ archive to be based on the book title.')
            ),
    }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from lxml import etree
        from calibre.ebooks.oeb.base import OEB_IMAGES, SVG_MIME
        from calibre.ebooks.metadata.opf2 import OPF, metadata_to_opf
        from calibre.utils.zipfile import ZipFile
        from calibre.utils.filenames import ascii_filename
        # HTML
        if opts.htmlz_css_type == 'inline':
            from calibre.ebooks.htmlz.oeb2html import OEB2HTMLInlineCSSizer
            OEB2HTMLizer = OEB2HTMLInlineCSSizer
        elif opts.htmlz_css_type == 'tag':
            from calibre.ebooks.htmlz.oeb2html import OEB2HTMLNoCSSizer
            OEB2HTMLizer = OEB2HTMLNoCSSizer
        else:
            from calibre.ebooks.htmlz.oeb2html import OEB2HTMLClassCSSizer as OEB2HTMLizer
        with TemporaryDirectory(u'_htmlz_output') as tdir:
            htmlizer = OEB2HTMLizer(log)
            html = htmlizer.oeb2html(oeb_book, opts)
            fname = u'index'
            if opts.htmlz_title_filename:
                from calibre.utils.filenames import shorten_components_to
                fname = shorten_components_to(100, (ascii_filename(unicode_type(oeb_book.metadata.title[0])),))[0]
            with open(os.path.join(tdir, fname+u'.html'), 'wb') as tf:
                if isinstance(html, unicode_type):
                    html = html.encode('utf-8')
                tf.write(html)
            # CSS
            if opts.htmlz_css_type == 'class' and opts.htmlz_class_style == 'external':
                with open(os.path.join(tdir, u'style.css'), 'wb') as tf:
                    tf.write(htmlizer.get_css(oeb_book))
            # Images
            images = htmlizer.images
            if images:
                if not os.path.exists(os.path.join(tdir, u'images')):
                    os.makedirs(os.path.join(tdir, u'images'))
                for item in oeb_book.manifest:
                    if item.media_type in OEB_IMAGES and item.href in images:
                        if item.media_type == SVG_MIME:
                            data = etree.tostring(item.data, encoding='unicode')
                        else:
                            data = item.data
                        fname = os.path.join(tdir, u'images', images[item.href])
                        with open(fname, 'wb') as img:
                            img.write(data)
            # Cover
            cover_path = None
            try:
                cover_data = None
                if oeb_book.metadata.cover:
                    term = oeb_book.metadata.cover[0].term
                    cover_data = oeb_book.guide[term].item.data
                if cover_data:
                    from calibre.utils.img import save_cover_data_to
                    cover_path = os.path.join(tdir, u'cover.jpg')
                    with lopen(cover_path, 'w') as cf:
                        cf.write('')
                    save_cover_data_to(cover_data, cover_path)
            except:
                import traceback
                traceback.print_exc()
            # Metadata
            with open(os.path.join(tdir, u'metadata.opf'), 'wb') as mdataf:
                opf = OPF(io.BytesIO(etree.tostring(oeb_book.metadata.to_opf1(), encoding='UTF-8')))
                mi = opf.to_book_metadata()
                if cover_path:
                    mi.cover = u'cover.jpg'
                mdataf.write(metadata_to_opf(mi))
            htmlz = ZipFile(output_path, 'w')
            htmlz.add_dir(tdir)
--- a/ebook_converter/ebooks/conversion/plugins/lit_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/lit_input.py
@@ -0,0 +1,64 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import InputFormatPlugin
 class LITInput(InputFormatPlugin):
    name        = 'LIT Input'
    author      = 'Marshall T. Vandegrift'
    description = 'Convert LIT files to HTML'
    file_types  = {'lit'}
    commit_name = 'lit_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.lit.reader import LitReader
        from calibre.ebooks.conversion.plumber import create_oebbook
        self.log = log
        return create_oebbook(log, stream, options, reader=LitReader)
    def postprocess_book(self, oeb, opts, log):
        from calibre.ebooks.oeb.base import XHTML_NS, XPath, XHTML
        for item in oeb.spine:
            root = item.data
            if not hasattr(root, 'xpath'):
                continue
            for bad in ('metadata', 'guide'):
                metadata = XPath('//h:'+bad)(root)
                if metadata:
                    for x in metadata:
                        x.getparent().remove(x)
            body = XPath('//h:body')(root)
            if body:
                body = body[0]
                if len(body) == 1 and body[0].tag == XHTML('pre'):
                    pre = body[0]
                    from calibre.ebooks.txt.processor import convert_basic, \
                        separate_paragraphs_single_line
                    from calibre.ebooks.chardet import xml_to_unicode
                    from calibre.utils.xml_parse import safe_xml_fromstring
                    import copy
                    self.log('LIT file with all text in singe <pre> tag detected')
                    html = separate_paragraphs_single_line(pre.text)
                    html = convert_basic(html).replace('<html>',
                            '<html xmlns="%s">'%XHTML_NS)
                    html = xml_to_unicode(html, strip_encoding_pats=True,
                            resolve_entities=True)[0]
                    if opts.smarten_punctuation:
                        # SmartyPants skips text inside <pre> tags
                        from calibre.ebooks.conversion.preprocess import smarten_punctuation
                        html = smarten_punctuation(html, self.log)
                    root = safe_xml_fromstring(html)
                    body = XPath('//h:body')(root)
                    pre.tag = XHTML('div')
                    pre.text = ''
                    for elem in body:
                        ne = copy.deepcopy(elem)
                        pre.append(ne)
--- a/ebook_converter/ebooks/conversion/plugins/lit_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/lit_output.py
@@ -0,0 +1,38 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import OutputFormatPlugin
 class LITOutput(OutputFormatPlugin):
    name = 'LIT Output'
    author = 'Marshall T. Vandegrift'
    file_type = 'lit'
    commit_name = 'lit_output'
    def convert(self, oeb, output_path, input_plugin, opts, log):
        self.log, self.opts, self.oeb = log, opts, oeb
        from calibre.ebooks.oeb.transforms.manglecase import CaseMangler
        from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer
        from calibre.ebooks.oeb.transforms.htmltoc import HTMLTOCAdder
        from calibre.ebooks.lit.writer import LitWriter
        from calibre.ebooks.oeb.transforms.split import Split
        split = Split(split_on_page_breaks=True, max_flow_size=0,
                remove_css_pagebreaks=False)
        split(self.oeb, self.opts)
        tocadder = HTMLTOCAdder()
        tocadder(oeb, opts)
        mangler = CaseMangler()
        mangler(oeb, opts)
        rasterizer = SVGRasterizer()
        rasterizer(oeb, opts)
        lit = LitWriter(self.opts)
        lit(oeb, output_path)
--- a/ebook_converter/ebooks/conversion/plugins/lrf_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/lrf_input.py
@@ -0,0 +1,82 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os, sys
 from calibre.customize.conversion import InputFormatPlugin
 class LRFInput(InputFormatPlugin):
    name        = 'LRF Input'
    author      = 'Kovid Goyal'
    description = 'Convert LRF files to HTML'
    file_types  = {'lrf'}
    commit_name = 'lrf_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.lrf.input import (MediaType, Styles, TextBlock,
                Canvas, ImageBlock, RuledLine)
        self.log = log
        self.log('Generating XML')
        from calibre.ebooks.lrf.lrfparser import LRFDocument
        from calibre.utils.xml_parse import safe_xml_fromstring
        from lxml import etree
        d = LRFDocument(stream)
        d.parse()
        xml = d.to_xml(write_files=True)
        if options.verbose > 2:
            open(u'lrs.xml', 'wb').write(xml.encode('utf-8'))
        doc = safe_xml_fromstring(xml)
        char_button_map = {}
        for x in doc.xpath('//CharButton[@refobj]'):
            ro = x.get('refobj')
            jump_button = doc.xpath('//*[@objid="%s"]'%ro)
            if jump_button:
                jump_to = jump_button[0].xpath('descendant::JumpTo[@refpage and @refobj]')
                if jump_to:
                    char_button_map[ro] = '%s.xhtml#%s'%(jump_to[0].get('refpage'),
                            jump_to[0].get('refobj'))
        plot_map = {}
        for x in doc.xpath('//Plot[@refobj]'):
            ro = x.get('refobj')
            image = doc.xpath('//Image[@objid="%s" and @refstream]'%ro)
            if image:
                imgstr = doc.xpath('//ImageStream[@objid="%s" and @file]'%
                    image[0].get('refstream'))
                if imgstr:
                    plot_map[ro] = imgstr[0].get('file')
        self.log('Converting XML to HTML...')
        styledoc = safe_xml_fromstring(P('templates/lrf.xsl', data=True))
        media_type = MediaType()
        styles = Styles()
        text_block = TextBlock(styles, char_button_map, plot_map, log)
        canvas = Canvas(doc, styles, text_block, log)
        image_block = ImageBlock(canvas)
        ruled_line = RuledLine()
        extensions = {
                ('calibre', 'media-type') : media_type,
                ('calibre', 'text-block') : text_block,
                ('calibre', 'ruled-line') : ruled_line,
                ('calibre', 'styles')     : styles,
                ('calibre', 'canvas')     : canvas,
                ('calibre', 'image-block'): image_block,
                }
        transform = etree.XSLT(styledoc, extensions=extensions)
        try:
            result = transform(doc)
        except RuntimeError:
            sys.setrecursionlimit(5000)
            result = transform(doc)
        with open('content.opf', 'wb') as f:
            f.write(result)
        styles.write()
        return os.path.abspath('content.opf')
--- a/ebook_converter/ebooks/conversion/plugins/lrf_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/lrf_output.py
@@ -0,0 +1,196 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import sys, os
 from calibre.customize.conversion import OutputFormatPlugin
 from calibre.customize.conversion import OptionRecommendation
 from polyglot.builtins import unicode_type
 class LRFOptions(object):
    def __init__(self, output, opts, oeb):
        def f2s(f):
            try:
                return unicode_type(f[0])
            except:
                return ''
        m = oeb.metadata
        for x in ('left', 'top', 'right', 'bottom'):
            attr = 'margin_'+x
            val = getattr(opts, attr)
            if val < 0:
                setattr(opts, attr, 0)
        self.title = None
        self.author = self.publisher = _('Unknown')
        self.title_sort = self.author_sort = ''
        for x in m.creator:
            if x.role == 'aut':
                self.author = unicode_type(x)
                fa = unicode_type(getattr(x, 'file_as', ''))
                if fa:
                    self.author_sort = fa
        for x in m.title:
            if unicode_type(x.file_as):
                self.title_sort = unicode_type(x.file_as)
        self.freetext = f2s(m.description)
        self.category = f2s(m.subject)
        self.cover = None
        self.use_metadata_cover = True
        self.output = output
        self.ignore_tables = opts.linearize_tables
        if opts.disable_font_rescaling:
            self.base_font_size = 0
        else:
            self.base_font_size = opts.base_font_size
        self.blank_after_para = opts.insert_blank_line
        self.use_spine = True
        self.font_delta = 0
        self.ignore_colors = False
        from calibre.ebooks.lrf import PRS500_PROFILE
        self.profile = PRS500_PROFILE
        self.link_levels = sys.maxsize
        self.link_exclude = '@'
        self.no_links_in_toc = True
        self.disable_chapter_detection = True
        self.chapter_regex = 'dsadcdswcdec'
        self.chapter_attr = '$,,$'
        self.override_css = self._override_css = ''
        self.page_break = 'h[12]'
        self.force_page_break = '$'
        self.force_page_break_attr = '$'
        self.add_chapters_to_toc = False
        self.baen = self.pdftohtml = self.book_designer = False
        self.verbose = opts.verbose
        self.encoding = 'utf-8'
        self.lrs = False
        self.minimize_memory_usage = False
        self.autorotation = opts.enable_autorotation
        self.header_separation = (self.profile.dpi/72.) * opts.header_separation
        self.headerformat = opts.header_format
        for x in ('top', 'bottom', 'left', 'right'):
            setattr(self, x+'_margin',
                (self.profile.dpi/72.) * float(getattr(opts, 'margin_'+x)))
        for x in ('wordspace', 'header', 'header_format',
                'minimum_indent', 'serif_family',
                'render_tables_as_images', 'sans_family', 'mono_family',
                'text_size_multiplier_for_rendered_tables'):
            setattr(self, x, getattr(opts, x))
 class LRFOutput(OutputFormatPlugin):
    name = 'LRF Output'
    author = 'Kovid Goyal'
    file_type = 'lrf'
    commit_name = 'lrf_output'
    options = {
        OptionRecommendation(name='enable_autorotation', recommended_value=False,
            help=_('Enable auto-rotation of images that are wider than the screen width.')
        ),
        OptionRecommendation(name='wordspace',
            recommended_value=2.5, level=OptionRecommendation.LOW,
            help=_('Set the space between words in pts. Default is %default')
        ),
        OptionRecommendation(name='header', recommended_value=False,
            help=_('Add a header to all the pages with title and author.')
        ),
        OptionRecommendation(name='header_format', recommended_value="%t by %a",
            help=_('Set the format of the header. %a is replaced by the author '
            'and %t by the title. Default is %default')
        ),
        OptionRecommendation(name='header_separation', recommended_value=0,
            help=_('Add extra spacing below the header. Default is %default pt.')
        ),
        OptionRecommendation(name='minimum_indent', recommended_value=0,
            help=_('Minimum paragraph indent (the indent of the first line '
            'of a paragraph) in pts. Default: %default')
        ),
        OptionRecommendation(name='render_tables_as_images',
            recommended_value=False,
            help=_('This option has no effect')
        ),
        OptionRecommendation(name='text_size_multiplier_for_rendered_tables',
            recommended_value=1.0,
            help=_('Multiply the size of text in rendered tables by this '
            'factor. Default is %default')
        ),
        OptionRecommendation(name='serif_family', recommended_value=None,
            help=_('The serif family of fonts to embed')
        ),
        OptionRecommendation(name='sans_family', recommended_value=None,
            help=_('The sans-serif family of fonts to embed')
        ),
        OptionRecommendation(name='mono_family', recommended_value=None,
            help=_('The monospace family of fonts to embed')
        ),
    }
    recommendations = {
        ('change_justification', 'original', OptionRecommendation.HIGH)}
    def convert_images(self, pages, opts, wide):
        from calibre.ebooks.lrf.pylrs.pylrs import Book, BookSetting, ImageStream, ImageBlock
        from uuid import uuid4
        from calibre.constants import __appname__, __version__
        width, height = (784, 1012) if wide else (584, 754)
        ps = {}
        ps['topmargin']      = 0
        ps['evensidemargin'] = 0
        ps['oddsidemargin']  = 0
        ps['textwidth']      = width
        ps['textheight']     = height
        book = Book(title=opts.title, author=opts.author,
                bookid=uuid4().hex,
                publisher='%s %s'%(__appname__, __version__),
                category=_('Comic'), pagestyledefault=ps,
                booksetting=BookSetting(screenwidth=width, screenheight=height))
        for page in pages:
            imageStream = ImageStream(page)
            _page = book.create_page()
            _page.append(ImageBlock(refstream=imageStream,
                        blockwidth=width, blockheight=height, xsize=width,
                        ysize=height, x1=width, y1=height))
            book.append(_page)
        book.renderLrf(open(opts.output, 'wb'))
    def flatten_toc(self):
        from calibre.ebooks.oeb.base import TOC
        nroot = TOC()
        for x in self.oeb.toc.iterdescendants():
            nroot.add(x.title, x.href)
        self.oeb.toc = nroot
    def convert(self, oeb, output_path, input_plugin, opts, log):
        self.log, self.opts, self.oeb = log, opts, oeb
        lrf_opts = LRFOptions(output_path, opts, oeb)
        if input_plugin.is_image_collection:
            self.convert_images(input_plugin.get_images(), lrf_opts,
                    getattr(opts, 'wide', False))
            return
        self.flatten_toc()
        from calibre.ptempfile import TemporaryDirectory
        with TemporaryDirectory('_lrf_output') as tdir:
            from calibre.customize.ui import plugin_for_output_format
            oeb_output = plugin_for_output_format('oeb')
            oeb_output.convert(oeb, tdir, input_plugin, opts, log)
            opf = [x for x in os.listdir(tdir) if x.endswith('.opf')][0]
            from calibre.ebooks.lrf.html.convert_from import process_file
            process_file(os.path.join(tdir, opf), lrf_opts, self.log)
--- a/ebook_converter/ebooks/conversion/plugins/mobi_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/mobi_input.py
@@ -0,0 +1,66 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import unicode_type
 class MOBIInput(InputFormatPlugin):
    name        = 'MOBI Input'
    author      = 'Kovid Goyal'
    description = 'Convert MOBI files (.mobi, .prc, .azw) to HTML'
    file_types  = {'mobi', 'prc', 'azw', 'azw3', 'pobi'}
    commit_name = 'mobi_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        self.is_kf8 = False
        self.mobi_is_joint = False
        from calibre.ebooks.mobi.reader.mobi6 import MobiReader
        from lxml import html
        parse_cache = {}
        try:
            mr = MobiReader(stream, log, options.input_encoding,
                        options.debug_pipeline)
            if mr.kf8_type is None:
                mr.extract_content('.', parse_cache)
        except:
            mr = MobiReader(stream, log, options.input_encoding,
                        options.debug_pipeline, try_extra_data_fix=True)
            if mr.kf8_type is None:
                mr.extract_content('.', parse_cache)
        if mr.kf8_type is not None:
            log('Found KF8 MOBI of type %r'%mr.kf8_type)
            if mr.kf8_type == 'joint':
                self.mobi_is_joint = True
            from calibre.ebooks.mobi.reader.mobi8 import Mobi8Reader
            mr = Mobi8Reader(mr, log)
            opf = os.path.abspath(mr())
            self.encrypted_fonts = mr.encrypted_fonts
            self.is_kf8 = True
            return opf
        raw = parse_cache.pop('calibre_raw_mobi_markup', False)
        if raw:
            if isinstance(raw, unicode_type):
                raw = raw.encode('utf-8')
            with lopen('debug-raw.html', 'wb') as f:
                f.write(raw)
        from calibre.ebooks.oeb.base import close_self_closing_tags
        for f, root in parse_cache.items():
            raw = html.tostring(root, encoding='utf-8', method='xml',
                    include_meta_content_type=False)
            raw = close_self_closing_tags(raw)
            with lopen(f, 'wb') as q:
                q.write(raw)
        accelerators['pagebreaks'] = '//h:div[@class="mbp_pagebreak"]'
        return mr.created_opf_path
--- a/ebook_converter/ebooks/conversion/plugins/mobi_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/mobi_output.py
@@ -0,0 +1,337 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import (OutputFormatPlugin,
        OptionRecommendation)
 from polyglot.builtins import unicode_type
 def remove_html_cover(oeb, log):
    from calibre.ebooks.oeb.base import OEB_DOCS
    if not oeb.metadata.cover \
        or 'cover' not in oeb.guide:
        return
    href = oeb.guide['cover'].href
    del oeb.guide['cover']
    item = oeb.manifest.hrefs[href]
    if item.spine_position is not None:
        log.warn('Found an HTML cover: ', item.href, 'removing it.',
                'If you find some content missing from the output MOBI, it '
                'is because you misidentified the HTML cover in the input '
                'document')
        oeb.spine.remove(item)
        if item.media_type in OEB_DOCS:
            oeb.manifest.remove(item)
 def extract_mobi(output_path, opts):
    if opts.extract_to is not None:
        from calibre.ebooks.mobi.debug.main import inspect_mobi
        ddir = opts.extract_to
        inspect_mobi(output_path, ddir=ddir)
 class MOBIOutput(OutputFormatPlugin):
    name = 'MOBI Output'
    author = 'Kovid Goyal'
    file_type = 'mobi'
    commit_name = 'mobi_output'
    ui_data = {'file_types': ['old', 'both', 'new']}
    options = {
        OptionRecommendation(name='prefer_author_sort',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('When present, use author sort field as author.')
        ),
        OptionRecommendation(name='no_inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Don\'t add Table of Contents to the book. Useful if '
                'the book has its own table of contents.')),
        OptionRecommendation(name='toc_title', recommended_value=None,
            help=_('Title for any generated in-line table of contents.')
        ),
        OptionRecommendation(name='dont_compress',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Disable compression of the file contents.')
        ),
        OptionRecommendation(name='personal_doc', recommended_value='[PDOC]',
            help=_('Tag for MOBI files to be marked as personal documents.'
                   ' This option has no effect on the conversion. It is used'
                   ' only when sending MOBI files to a device. If the file'
                   ' being sent has the specified tag, it will be marked as'
                   ' a personal document when sent to the Kindle.')
        ),
        OptionRecommendation(name='mobi_ignore_margins',
            recommended_value=False,
            help=_('Ignore margins in the input document. If False, then '
                'the MOBI output plugin will try to convert margins specified'
                ' in the input document, otherwise it will ignore them.')
        ),
        OptionRecommendation(name='mobi_toc_at_start',
            recommended_value=False,
            help=_('When adding the Table of Contents to the book, add it at the start of the '
                'book instead of the end. Not recommended.')
        ),
        OptionRecommendation(name='extract_to',
            help=_('Extract the contents of the generated %s file to the '
                'specified directory. The contents of the directory are first '
                'deleted, so be careful.') % 'MOBI'
        ),
        OptionRecommendation(name='share_not_sync', recommended_value=False,
            help=_('Enable sharing of book content via Facebook etc. '
                ' on the Kindle. WARNING: Using this feature means that '
                ' the book will not auto sync its last read position '
                ' on multiple devices. Complain to Amazon.')
        ),
        OptionRecommendation(name='mobi_keep_original_images',
            recommended_value=False,
            help=_('By default calibre converts all images to JPEG format '
                'in the output MOBI file. This is for maximum compatibility '
                'as some older MOBI viewers have problems with other image '
                'formats. This option tells calibre not to do this. '
                'Useful if your document contains lots of GIF/PNG images that '
                'become very large when converted to JPEG.')),
        OptionRecommendation(name='mobi_file_type', choices=ui_data['file_types'], recommended_value='old',
            help=_('By default calibre generates MOBI files that contain the '
                'old MOBI 6 format. This format is compatible with all '
                'devices. However, by changing this setting, you can tell '
                'calibre to generate MOBI files that contain both MOBI 6 and '
                'the new KF8 format, or only the new KF8 format. KF8 has '
                'more features than MOBI 6, but only works with newer Kindles. '
                'Allowed values: {}').format('old, both, new')),
    }
    def check_for_periodical(self):
        if self.is_periodical:
            self.periodicalize_toc()
            self.check_for_masthead()
            self.opts.mobi_periodical = True
        else:
            self.opts.mobi_periodical = False
    def check_for_masthead(self):
        found = 'masthead' in self.oeb.guide
        if not found:
            from calibre.ebooks import generate_masthead
            self.oeb.log.debug('No masthead found in manifest, generating default mastheadImage...')
            raw = generate_masthead(unicode_type(self.oeb.metadata['title'][0]))
            id, href = self.oeb.manifest.generate('masthead', 'masthead')
            self.oeb.manifest.add(id, href, 'image/gif', data=raw)
            self.oeb.guide.add('masthead', 'Masthead Image', href)
        else:
            self.oeb.log.debug('Using mastheadImage supplied in manifest...')
    def periodicalize_toc(self):
        from calibre.ebooks.oeb.base import TOC
        toc = self.oeb.toc
        if not toc or len(self.oeb.spine) < 3:
            return
        if toc and toc[0].klass != 'periodical':
            one, two = self.oeb.spine[0], self.oeb.spine[1]
            self.log('Converting TOC for MOBI periodical indexing...')
            articles = {}
            if toc.depth() < 3:
                # single section periodical
                self.oeb.manifest.remove(one)
                self.oeb.manifest.remove(two)
                sections = [TOC(klass='section', title=_('All articles'),
                    href=self.oeb.spine[0].href)]
                for x in toc:
                    sections[0].nodes.append(x)
            else:
                # multi-section periodical
                self.oeb.manifest.remove(one)
                sections = list(toc)
                for i,x in enumerate(sections):
                    x.klass = 'section'
                    articles_ = list(x)
                    if articles_:
                        self.oeb.manifest.remove(self.oeb.manifest.hrefs[x.href])
                        x.href = articles_[0].href
            for sec in sections:
                articles[id(sec)] = []
                for a in list(sec):
                    a.klass = 'article'
                    articles[id(sec)].append(a)
                    sec.nodes.remove(a)
            root = TOC(klass='periodical', href=self.oeb.spine[0].href,
                    title=unicode_type(self.oeb.metadata.title[0]))
            for s in sections:
                if articles[id(s)]:
                    for a in articles[id(s)]:
                        s.nodes.append(a)
                    root.nodes.append(s)
            for x in list(toc.nodes):
                toc.nodes.remove(x)
            toc.nodes.append(root)
            # Fix up the periodical href to point to first section href
            toc.nodes[0].href = toc.nodes[0].nodes[0].href
    def convert(self, oeb, output_path, input_plugin, opts, log):
        from calibre.ebooks.mobi.writer2.resources import Resources
        self.log, self.opts, self.oeb = log, opts, oeb
        mobi_type = opts.mobi_file_type
        if self.is_periodical:
            mobi_type = 'old'  # Amazon does not support KF8 periodicals
        create_kf8 = mobi_type in ('new', 'both')
        remove_html_cover(self.oeb, self.log)
        resources = Resources(oeb, opts, self.is_periodical,
                add_fonts=create_kf8)
        self.check_for_periodical()
        if create_kf8:
            from calibre.ebooks.mobi.writer8.cleanup import remove_duplicate_anchors
            remove_duplicate_anchors(self.oeb)
            # Split on pagebreaks so that the resulting KF8 is faster to load
            from calibre.ebooks.oeb.transforms.split import Split
            Split()(self.oeb, self.opts)
        kf8 = self.create_kf8(resources, for_joint=mobi_type=='both'
                ) if create_kf8 else None
        if mobi_type == 'new':
            kf8.write(output_path)
            extract_mobi(output_path, opts)
            return
        self.log('Creating MOBI 6 output')
        self.write_mobi(input_plugin, output_path, kf8, resources)
    def create_kf8(self, resources, for_joint=False):
        from calibre.ebooks.mobi.writer8.main import create_kf8_book
        return create_kf8_book(self.oeb, self.opts, resources,
                for_joint=for_joint)
    def write_mobi(self, input_plugin, output_path, kf8, resources):
        from calibre.ebooks.mobi.mobiml import MobiMLizer
        from calibre.ebooks.oeb.transforms.manglecase import CaseMangler
        from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
        from calibre.ebooks.oeb.transforms.htmltoc import HTMLTOCAdder
        from calibre.customize.ui import plugin_for_input_format
        opts, oeb = self.opts, self.oeb
        if not opts.no_inline_toc:
            tocadder = HTMLTOCAdder(title=opts.toc_title, position='start' if
                    opts.mobi_toc_at_start else 'end')
            tocadder(oeb, opts)
        mangler = CaseMangler()
        mangler(oeb, opts)
        try:
            rasterizer = SVGRasterizer()
            rasterizer(oeb, opts)
        except Unavailable:
            self.log.warn('SVG rasterizer unavailable, SVG will not be converted')
        else:
            # Add rasterized SVG images
            resources.add_extra_images()
        if hasattr(self.oeb, 'inserted_metadata_jacket'):
            self.workaround_fire_bugs(self.oeb.inserted_metadata_jacket)
        mobimlizer = MobiMLizer(ignore_tables=opts.linearize_tables)
        mobimlizer(oeb, opts)
        write_page_breaks_after_item = input_plugin is not plugin_for_input_format('cbz')
        from calibre.ebooks.mobi.writer2.main import MobiWriter
        writer = MobiWriter(opts, resources, kf8,
                        write_page_breaks_after_item=write_page_breaks_after_item)
        writer(oeb, output_path)
        extract_mobi(output_path, opts)
    def specialize_css_for_output(self, log, opts, item, stylizer):
        from calibre.ebooks.mobi.writer8.cleanup import CSSCleanup
        CSSCleanup(log, opts)(item, stylizer)
    def workaround_fire_bugs(self, jacket):
        # The idiotic Fire crashes when trying to render the table used to
        # layout the jacket
        from calibre.ebooks.oeb.base import XHTML
        for table in jacket.data.xpath('//*[local-name()="table"]'):
            table.tag = XHTML('div')
            for tr in table.xpath('descendant::*[local-name()="tr"]'):
                cols = tr.xpath('descendant::*[local-name()="td"]')
                tr.tag = XHTML('div')
                for td in cols:
                    td.tag = XHTML('span' if cols else 'div')
 class AZW3Output(OutputFormatPlugin):
    name = 'AZW3 Output'
    author = 'Kovid Goyal'
    file_type = 'azw3'
    commit_name = 'azw3_output'
    options = {
        OptionRecommendation(name='prefer_author_sort',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('When present, use author sort field as author.')
        ),
        OptionRecommendation(name='no_inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Don\'t add Table of Contents to the book. Useful if '
                'the book has its own table of contents.')),
        OptionRecommendation(name='toc_title', recommended_value=None,
            help=_('Title for any generated in-line table of contents.')
        ),
        OptionRecommendation(name='dont_compress',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Disable compression of the file contents.')
        ),
        OptionRecommendation(name='mobi_toc_at_start',
            recommended_value=False,
            help=_('When adding the Table of Contents to the book, add it at the start of the '
                'book instead of the end. Not recommended.')
        ),
        OptionRecommendation(name='extract_to',
            help=_('Extract the contents of the generated %s file to the '
                'specified directory. The contents of the directory are first '
                'deleted, so be careful.') % 'AZW3'),
        OptionRecommendation(name='share_not_sync', recommended_value=False,
            help=_('Enable sharing of book content via Facebook etc. '
                ' on the Kindle. WARNING: Using this feature means that '
                ' the book will not auto sync its last read position '
                ' on multiple devices. Complain to Amazon.')
        ),
    }
    def convert(self, oeb, output_path, input_plugin, opts, log):
        from calibre.ebooks.mobi.writer2.resources import Resources
        from calibre.ebooks.mobi.writer8.main import create_kf8_book
        from calibre.ebooks.mobi.writer8.cleanup import remove_duplicate_anchors
        self.oeb, self.opts, self.log = oeb, opts, log
        opts.mobi_periodical = self.is_periodical
        passthrough = getattr(opts, 'mobi_passthrough', False)
        remove_duplicate_anchors(oeb)
        resources = Resources(self.oeb, self.opts, self.is_periodical,
                add_fonts=True, process_images=False)
        if not passthrough:
            remove_html_cover(self.oeb, self.log)
            # Split on pagebreaks so that the resulting KF8 is faster to load
            from calibre.ebooks.oeb.transforms.split import Split
            Split()(self.oeb, self.opts)
        kf8 = create_kf8_book(self.oeb, self.opts, resources, for_joint=False)
        kf8.write(output_path)
        extract_mobi(output_path, opts)
    def specialize_css_for_output(self, log, opts, item, stylizer):
        from calibre.ebooks.mobi.writer8.cleanup import CSSCleanup
        CSSCleanup(log, opts)(item, stylizer)
--- a/ebook_converter/ebooks/conversion/plugins/odt_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/odt_input.py
@@ -0,0 +1,25 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
 __docformat__ = 'restructuredtext en'
 '''
 Convert an ODT file into a Open Ebook
 '''
 from calibre.customize.conversion import InputFormatPlugin
 class ODTInput(InputFormatPlugin):
    name        = 'ODT Input'
    author      = 'Kovid Goyal'
    description = 'Convert ODT (OpenOffice) files to HTML'
    file_types  = {'odt'}
    commit_name = 'odt_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.odt.input import Extract
        return Extract()(stream, '.', log)
--- a/ebook_converter/ebooks/conversion/plugins/oeb_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/oeb_output.py
@@ -0,0 +1,122 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os, re
 from calibre.customize.conversion import (OutputFormatPlugin,
        OptionRecommendation)
 from calibre import CurrentDir
 class OEBOutput(OutputFormatPlugin):
    name = 'OEB Output'
    author = 'Kovid Goyal'
    file_type = 'oeb'
    commit_name = 'oeb_output'
    recommendations = {('pretty_print', True, OptionRecommendation.HIGH)}
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from polyglot.urllib import unquote
        from lxml import etree
        self.log, self.opts = log, opts
        if not os.path.exists(output_path):
            os.makedirs(output_path)
        from calibre.ebooks.oeb.base import OPF_MIME, NCX_MIME, PAGE_MAP_MIME, OEB_STYLES
        from calibre.ebooks.oeb.normalize_css import condense_sheet
        with CurrentDir(output_path):
            results = oeb_book.to_opf2(page_map=True)
            for key in (OPF_MIME, NCX_MIME, PAGE_MAP_MIME):
                href, root = results.pop(key, [None, None])
                if root is not None:
                    if key == OPF_MIME:
                        try:
                            self.workaround_nook_cover_bug(root)
                        except:
                            self.log.exception('Something went wrong while trying to'
                                    ' workaround Nook cover bug, ignoring')
                        try:
                            self.workaround_pocketbook_cover_bug(root)
                        except:
                            self.log.exception('Something went wrong while trying to'
                                    ' workaround Pocketbook cover bug, ignoring')
                        self.migrate_lang_code(root)
                    raw = etree.tostring(root, pretty_print=True,
                            encoding='utf-8', xml_declaration=True)
                    if key == OPF_MIME:
                        # Needed as I can't get lxml to output opf:role and
                        # not output <opf:metadata> as well
                        raw = re.sub(br'(<[/]{0,1})opf:', br'\1', raw)
                    with lopen(href, 'wb') as f:
                        f.write(raw)
            for item in oeb_book.manifest:
                if (
                        not self.opts.expand_css and item.media_type in OEB_STYLES and hasattr(
                            item.data, 'cssText') and 'nook' not in self.opts.output_profile.short_name):
                    condense_sheet(item.data)
                path = os.path.abspath(unquote(item.href))
                dir = os.path.dirname(path)
                if not os.path.exists(dir):
                    os.makedirs(dir)
                with lopen(path, 'wb') as f:
                    f.write(item.bytes_representation)
                item.unload_data_from_memory(memory=path)
    def workaround_nook_cover_bug(self, root):  # {{{
        cov = root.xpath('//*[local-name() = "meta" and @name="cover" and'
                ' @content != "cover"]')
        def manifest_items_with_id(id_):
            return root.xpath('//*[local-name() = "manifest"]/*[local-name() = "item" '
                ' and @id="%s"]'%id_)
        if len(cov) == 1:
            cov = cov[0]
            covid = cov.get('content', '')
            if covid:
                manifest_item = manifest_items_with_id(covid)
                if len(manifest_item) == 1 and \
                        manifest_item[0].get('media-type',
                                '').startswith('image/'):
                    self.log.warn('The cover image has an id != "cover". Renaming'
                            ' to work around bug in Nook Color')
                    from calibre.ebooks.oeb.base import uuid_id
                    newid = uuid_id()
                    for item in manifest_items_with_id('cover'):
                        item.set('id', newid)
                    for x in root.xpath('//*[@idref="cover"]'):
                        x.set('idref', newid)
                    manifest_item = manifest_item[0]
                    manifest_item.set('id', 'cover')
                    cov.set('content', 'cover')
    # }}}
    def workaround_pocketbook_cover_bug(self, root):  # {{{
        m = root.xpath('//*[local-name() = "manifest"]/*[local-name() = "item" '
                ' and @id="cover"]')
        if len(m) == 1:
            m = m[0]
            p = m.getparent()
            p.remove(m)
            p.insert(0, m)
    # }}}
    def migrate_lang_code(self, root):  # {{{
        from calibre.utils.localization import lang_as_iso639_1
        for lang in root.xpath('//*[local-name() = "language"]'):
            clc = lang_as_iso639_1(lang.text)
            if clc:
                lang.text = clc
    # }}}
--- a/ebook_converter/ebooks/conversion/plugins/pdb_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/pdb_input.py
@@ -0,0 +1,37 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import getcwd
 class PDBInput(InputFormatPlugin):
    name        = 'PDB Input'
    author      = 'John Schember'
    description = 'Convert PDB to HTML'
    file_types  = {'pdb', 'updb'}
    commit_name = 'pdb_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.pdb.header import PdbHeaderReader
        from calibre.ebooks.pdb import PDBError, IDENTITY_TO_NAME, get_reader
        header = PdbHeaderReader(stream)
        Reader = get_reader(header.ident)
        if Reader is None:
            raise PDBError('No reader available for format within container.\n Identity is %s. Book type is %s' %
                           (header.ident, IDENTITY_TO_NAME.get(header.ident, _('Unknown'))))
        log.debug('Detected ebook format as: %s with identity: %s' % (IDENTITY_TO_NAME[header.ident], header.ident))
        reader = Reader(header, stream, log, options)
        opf = reader.extract_content(getcwd())
        return opf
--- a/ebook_converter/ebooks/conversion/plugins/pdb_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pdb_output.py
@@ -0,0 +1,64 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin, \
    OptionRecommendation
 from calibre.ebooks.pdb import PDBError, get_writer, ALL_FORMAT_WRITERS
 class PDBOutput(OutputFormatPlugin):
    name = 'PDB Output'
    author = 'John Schember'
    file_type = 'pdb'
    commit_name = 'pdb_output'
    ui_data = {'formats': tuple(ALL_FORMAT_WRITERS)}
    options = {
        OptionRecommendation(name='format', recommended_value='doc',
            level=OptionRecommendation.LOW,
            short_switch='f', choices=list(ALL_FORMAT_WRITERS),
            help=(_('Format to use inside the pdb container. Choices are:') + ' %s' % sorted(ALL_FORMAT_WRITERS))),
        OptionRecommendation(name='pdb_output_encoding', recommended_value='cp1252',
            level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the output document. '
            'The default is cp1252. Note: This option is not honored by all '
            'formats.')),
        OptionRecommendation(name='inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Add Table of Contents to beginning of the book.')),
    }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
                os.makedirs(os.path.dirname(output_path))
            out_stream = lopen(output_path, 'wb')
        else:
            out_stream = output_path
        Writer = get_writer(opts.format)
        if Writer is None:
            raise PDBError('No writer available for format %s.' % format)
        setattr(opts, 'max_line_length', 0)
        setattr(opts, 'force_max_line_length', False)
        writer = Writer(opts, log)
        out_stream.seek(0)
        out_stream.truncate()
        writer.write_content(oeb_book, out_stream, oeb_book.metadata)
        if close:
            out_stream.close()
--- a/ebook_converter/ebooks/conversion/plugins/pdf_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/pdf_input.py
@@ -0,0 +1,82 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from polyglot.builtins import as_bytes, getcwd
 class PDFInput(InputFormatPlugin):
    name        = 'PDF Input'
    author      = 'Kovid Goyal and John Schember'
    description = 'Convert PDF files to HTML'
    file_types  = {'pdf'}
    commit_name = 'pdf_input'
    options = {
        OptionRecommendation(name='no_images', recommended_value=False,
            help=_('Do not extract images from the document')),
        OptionRecommendation(name='unwrap_factor', recommended_value=0.45,
            help=_('Scale used to determine the length at which a line should '
            'be unwrapped. Valid values are a decimal between 0 and 1. The '
            'default is 0.45, just below the median line length.')),
        OptionRecommendation(name='new_pdf_engine', recommended_value=False,
            help=_('Use the new PDF conversion engine. Currently not operational.'))
    }
    def convert_new(self, stream, accelerators):
        from calibre.ebooks.pdf.pdftohtml import pdftohtml
        from calibre.utils.cleantext import clean_ascii_chars
        from calibre.ebooks.pdf.reflow import PDFDocument
        pdftohtml(getcwd(), stream.name, self.opts.no_images, as_xml=True)
        with lopen('index.xml', 'rb') as f:
            xml = clean_ascii_chars(f.read())
        PDFDocument(xml, self.opts, self.log)
        return os.path.join(getcwd(), 'metadata.opf')
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.metadata.opf2 import OPFCreator
        from calibre.ebooks.pdf.pdftohtml import pdftohtml
        log.debug('Converting file to html...')
        # The main html file will be named index.html
        self.opts, self.log = options, log
        if options.new_pdf_engine:
            return self.convert_new(stream, accelerators)
        pdftohtml(getcwd(), stream.name, options.no_images)
        from calibre.ebooks.metadata.meta import get_metadata
        log.debug('Retrieving document metadata...')
        mi = get_metadata(stream, 'pdf')
        opf = OPFCreator(getcwd(), mi)
        manifest = [('index.html', None)]
        images = os.listdir(getcwd())
        images.remove('index.html')
        for i in images:
            manifest.append((i, None))
        log.debug('Generating manifest...')
        opf.create_manifest(manifest)
        opf.create_spine(['index.html'])
        log.debug('Rendering manifest...')
        with lopen('metadata.opf', 'wb') as opffile:
            opf.render(opffile)
        if os.path.exists('toc.ncx'):
            ncxid = opf.manifest.id_for_path('toc.ncx')
            if ncxid:
                with lopen('metadata.opf', 'r+b') as f:
                    raw = f.read().replace(b'<spine', b'<spine toc="%s"' % as_bytes(ncxid))
                    f.seek(0)
                    f.write(raw)
        return os.path.join(getcwd(), 'metadata.opf')
--- a/ebook_converter/ebooks/conversion/plugins/pdf_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pdf_output.py
@@ -0,0 +1,256 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2012, Kovid Goyal <kovid at kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 '''
 Convert OEB ebook format to PDF.
 '''
 import glob, os
 from calibre.customize.conversion import (OutputFormatPlugin,
    OptionRecommendation)
 from calibre.ptempfile import TemporaryDirectory
 from polyglot.builtins import iteritems, unicode_type
 UNITS = ('millimeter', 'centimeter', 'point', 'inch' , 'pica' , 'didot',
        'cicero', 'devicepixel')
 PAPER_SIZES = ('a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'b0', 'b1',
        'b2', 'b3', 'b4', 'b5', 'b6', 'legal', 'letter')
 class PDFOutput(OutputFormatPlugin):
    name = 'PDF Output'
    author = 'Kovid Goyal'
    file_type = 'pdf'
    commit_name = 'pdf_output'
    ui_data = {'paper_sizes': PAPER_SIZES, 'units': UNITS, 'font_types': ('serif', 'sans', 'mono')}
    options = {
        OptionRecommendation(name='use_profile_size', recommended_value=False,
            help=_('Instead of using the paper size specified in the PDF Output options,'
                   ' use a paper size corresponding to the current output profile.'
                   ' Useful if you want to generate a PDF for viewing on a specific device.')),
        OptionRecommendation(name='unit', recommended_value='inch',
            level=OptionRecommendation.LOW, short_switch='u', choices=UNITS,
            help=_('The unit of measure for page sizes. Default is inch. Choices '
            'are {} '
            'Note: This does not override the unit for margins!').format(', '.join(UNITS))),
        OptionRecommendation(name='paper_size', recommended_value='letter',
            level=OptionRecommendation.LOW, choices=PAPER_SIZES,
            help=_('The size of the paper. This size will be overridden when a '
            'non default output profile is used. Default is letter. Choices '
            'are {}').format(', '.join(PAPER_SIZES))),
        OptionRecommendation(name='custom_size', recommended_value=None,
            help=_('Custom size of the document. Use the form widthxheight '
            'e.g. `123x321` to specify the width and height. '
            'This overrides any specified paper-size.')),
        OptionRecommendation(name='preserve_cover_aspect_ratio',
            recommended_value=False,
            help=_('Preserve the aspect ratio of the cover, instead'
                ' of stretching it to fill the full first page of the'
                ' generated pdf.')),
        OptionRecommendation(name='pdf_serif_family',
            recommended_value='Times', help=_(
                'The font family used to render serif fonts. Will work only if the font is available system-wide.')),
        OptionRecommendation(name='pdf_sans_family',
            recommended_value='Helvetica', help=_(
                'The font family used to render sans-serif fonts. Will work only if the font is available system-wide.')),
        OptionRecommendation(name='pdf_mono_family',
            recommended_value='Courier', help=_(
                'The font family used to render monospace fonts. Will work only if the font is available system-wide.')),
        OptionRecommendation(name='pdf_standard_font', choices=ui_data['font_types'],
            recommended_value='serif', help=_(
                'The font family used to render monospace fonts')),
        OptionRecommendation(name='pdf_default_font_size',
            recommended_value=20, help=_(
                'The default font size')),
        OptionRecommendation(name='pdf_mono_font_size',
            recommended_value=16, help=_(
                'The default font size for monospaced text')),
        OptionRecommendation(name='pdf_hyphenate', recommended_value=False,
            help=_('Break long words at the end of lines. This can give the text at the right margin a more even appearance.')),
        OptionRecommendation(name='pdf_mark_links', recommended_value=False,
            help=_('Surround all links with a red box, useful for debugging.')),
        OptionRecommendation(name='pdf_page_numbers', recommended_value=False,
            help=_('Add page numbers to the bottom of every page in the generated PDF file. If you '
                   'specify a footer template, it will take precedence '
                   'over this option.')),
        OptionRecommendation(name='pdf_footer_template', recommended_value=None,
            help=_('An HTML template used to generate %s on every page.'
                   ' The strings _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_ will be replaced by their current values.')%_('footers')),
        OptionRecommendation(name='pdf_header_template', recommended_value=None,
            help=_('An HTML template used to generate %s on every page.'
                   ' The strings _PAGENUM_, _TITLE_, _AUTHOR_ and _SECTION_ will be replaced by their current values.')%_('headers')),
        OptionRecommendation(name='pdf_add_toc', recommended_value=False,
            help=_('Add a Table of Contents at the end of the PDF that lists page numbers. '
                   'Useful if you want to print out the PDF. If this PDF is intended for electronic use, use the PDF Outline instead.')),
        OptionRecommendation(name='toc_title', recommended_value=None,
            help=_('Title for generated table of contents.')
        ),
        OptionRecommendation(name='pdf_page_margin_left', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the left page margin, in pts. Default is 72pt.'
                   ' Overrides the common left page margin setting.')
        ),
        OptionRecommendation(name='pdf_page_margin_top', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the top page margin, in pts. Default is 72pt.'
                   ' Overrides the common top page margin setting, unless set to zero.')
        ),
        OptionRecommendation(name='pdf_page_margin_right', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the right page margin, in pts. Default is 72pt.'
                   ' Overrides the common right page margin setting, unless set to zero.')
        ),
        OptionRecommendation(name='pdf_page_margin_bottom', recommended_value=72.0,
            level=OptionRecommendation.LOW,
            help=_('The size of the bottom page margin, in pts. Default is 72pt.'
                   ' Overrides the common bottom page margin setting, unless set to zero.')
        ),
        OptionRecommendation(name='pdf_use_document_margins', recommended_value=False,
            help=_('Use the page margins specified in the input document via @page CSS rules.'
            ' This will cause the margins specified in the conversion settings to be ignored.'
            ' If the document does not specify page margins, the conversion settings will be used as a fallback.')
        ),
        OptionRecommendation(name='pdf_page_number_map', recommended_value=None,
            help=_('Adjust page numbers, as needed. Syntax is a JavaScript expression for the page number.'
                ' For example, "if (n < 3) 0; else n - 3;", where n is current page number.')
        ),
        OptionRecommendation(name='uncompressed_pdf',
            recommended_value=False, help=_(
                'Generate an uncompressed PDF, useful for debugging.')
        ),
        OptionRecommendation(name='pdf_odd_even_offset', recommended_value=0.0,
            level=OptionRecommendation.LOW,
            help=_(
                'Shift the text horizontally by the specified offset (in pts).'
                ' On odd numbered pages, it is shifted to the right and on even'
                ' numbered pages to the left. Use negative numbers for the opposite'
                ' effect. Note that this setting is ignored on pages where the margins'
                ' are smaller than the specified offset. Shifting is done by setting'
                ' the PDF CropBox, not all software respects the CropBox.'
            )
        ),
    }
    def specialize_options(self, log, opts, input_fmt):
        # Ensure Qt is setup to be used with WebEngine
        # specialize_options is called early enough in the pipeline
        # that hopefully no Qt application has been constructed as yet
        from PyQt5.QtWebEngineCore import QWebEngineUrlScheme
        from PyQt5.QtWebEngineWidgets import QWebEnginePage  # noqa
        from calibre.gui2 import must_use_qt
        from calibre.constants import FAKE_PROTOCOL
        scheme = QWebEngineUrlScheme(FAKE_PROTOCOL.encode('ascii'))
        scheme.setSyntax(QWebEngineUrlScheme.Syntax.Host)
        scheme.setFlags(QWebEngineUrlScheme.SecureScheme)
        QWebEngineUrlScheme.registerScheme(scheme)
        must_use_qt()
        self.input_fmt = input_fmt
        if opts.pdf_use_document_margins:
            # Prevent the conversion pipeline from overwriting document margins
            opts.margin_left = opts.margin_right = opts.margin_top = opts.margin_bottom = -1
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        self.stored_page_margins = getattr(opts, '_stored_page_margins', {})
        self.oeb = oeb_book
        self.input_plugin, self.opts, self.log = input_plugin, opts, log
        self.output_path = output_path
        from calibre.ebooks.oeb.base import OPF, OPF2_NS
        from lxml import etree
        from io import BytesIO
        package = etree.Element(OPF('package'),
            attrib={'version': '2.0', 'unique-identifier': 'dummy'},
            nsmap={None: OPF2_NS})
        from calibre.ebooks.metadata.opf2 import OPF
        self.oeb.metadata.to_opf2(package)
        self.metadata = OPF(BytesIO(etree.tostring(package))).to_book_metadata()
        self.cover_data = None
        if input_plugin.is_image_collection:
            log.debug('Converting input as an image collection...')
            self.convert_images(input_plugin.get_images())
        else:
            log.debug('Converting input as a text based book...')
            self.convert_text(oeb_book)
    def convert_images(self, images):
        from calibre.ebooks.pdf.image_writer import convert
        convert(images, self.output_path, self.opts, self.metadata, self.report_progress)
    def get_cover_data(self):
        oeb = self.oeb
        if (oeb.metadata.cover and unicode_type(oeb.metadata.cover[0]) in oeb.manifest.ids):
            cover_id = unicode_type(oeb.metadata.cover[0])
            item = oeb.manifest.ids[cover_id]
            self.cover_data = item.data
    def process_fonts(self):
        ''' Make sure all fonts are embeddable '''
        from calibre.ebooks.oeb.base import urlnormalize
        from calibre.utils.fonts.utils import remove_embed_restriction
        processed = set()
        for item in list(self.oeb.manifest):
            if not hasattr(item.data, 'cssRules'):
                continue
            for i, rule in enumerate(item.data.cssRules):
                if rule.type == rule.FONT_FACE_RULE:
                    try:
                        s = rule.style
                        src = s.getProperty('src').propertyValue[0].uri
                    except:
                        continue
                    path = item.abshref(src)
                    ff = self.oeb.manifest.hrefs.get(urlnormalize(path), None)
                    if ff is None:
                        continue
                    raw = nraw = ff.data
                    if path not in processed:
                        processed.add(path)
                        try:
                            nraw = remove_embed_restriction(raw)
                        except:
                            continue
                        if nraw != raw:
                            ff.data = nraw
                            self.oeb.container.write(path, nraw)
    def convert_text(self, oeb_book):
        import json
        from calibre.ebooks.pdf.html_writer import convert
        self.get_cover_data()
        self.process_fonts()
        if self.opts.pdf_use_document_margins and self.stored_page_margins:
            for href, margins in iteritems(self.stored_page_margins):
                item = oeb_book.manifest.hrefs.get(href)
                if item is not None:
                    root = item.data
                    if hasattr(root, 'xpath') and margins:
                        root.set('data-calibre-pdf-output-page-margins', json.dumps(margins))
        with TemporaryDirectory('_pdf_out') as oeb_dir:
            from calibre.customize.ui import plugin_for_output_format
            oeb_dir = os.path.realpath(oeb_dir)
            oeb_output = plugin_for_output_format('oeb')
            oeb_output.convert(oeb_book, oeb_dir, self.input_plugin, self.opts, self.log)
            opfpath = glob.glob(os.path.join(oeb_dir, '*.opf'))[0]
            convert(
                opfpath, self.opts, metadata=self.metadata, output_path=self.output_path,
                log=self.log, cover_data=self.cover_data, report_progress=self.report_progress
            )
--- a/ebook_converter/ebooks/conversion/plugins/pml_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/pml_input.py
@@ -0,0 +1,165 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import glob
 import os
 import shutil
 from calibre.customize.conversion import InputFormatPlugin
 from calibre.ptempfile import TemporaryDirectory
 from polyglot.builtins import getcwd
 class PMLInput(InputFormatPlugin):
    name        = 'PML Input'
    author      = 'John Schember'
    description = 'Convert PML to OEB'
    # pmlz is a zip file containing pml files and png images.
    file_types  = {'pml', 'pmlz'}
    commit_name = 'pml_input'
    def process_pml(self, pml_path, html_path, close_all=False):
        from calibre.ebooks.pml.pmlconverter import PML_HTMLizer
        pclose = False
        hclose = False
        if not hasattr(pml_path, 'read'):
            pml_stream = lopen(pml_path, 'rb')
            pclose = True
        else:
            pml_stream = pml_path
            pml_stream.seek(0)
        if not hasattr(html_path, 'write'):
            html_stream = lopen(html_path, 'wb')
            hclose = True
        else:
            html_stream = html_path
        ienc = getattr(pml_stream, 'encoding', None)
        if ienc is None:
            ienc = 'cp1252'
        if self.options.input_encoding:
            ienc = self.options.input_encoding
        self.log.debug('Converting PML to HTML...')
        hizer = PML_HTMLizer()
        html = hizer.parse_pml(pml_stream.read().decode(ienc), html_path)
        html = '<html><head><title></title></head><body>%s</body></html>'%html
        html_stream.write(html.encode('utf-8', 'replace'))
        if pclose:
            pml_stream.close()
        if hclose:
            html_stream.close()
        return hizer.get_toc()
    def get_images(self, stream, tdir, top_level=False):
        images = []
        imgs = []
        if top_level:
            imgs = glob.glob(os.path.join(tdir, '*.png'))
        # Images not in top level try bookname_img directory because
        # that's where Dropbook likes to see them.
        if not imgs:
            if hasattr(stream, 'name'):
                imgs = glob.glob(os.path.join(tdir, os.path.splitext(os.path.basename(stream.name))[0] + '_img', '*.png'))
        # No images in Dropbook location try generic images directory
        if not imgs:
            imgs = glob.glob(os.path.join(os.path.join(tdir, 'images'), '*.png'))
        if imgs:
            os.makedirs(os.path.join(getcwd(), 'images'))
        for img in imgs:
            pimg_name = os.path.basename(img)
            pimg_path = os.path.join(getcwd(), 'images', pimg_name)
            images.append('images/' + pimg_name)
            shutil.copy(img, pimg_path)
        return images
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.metadata.toc import TOC
        from calibre.ebooks.metadata.opf2 import OPFCreator
        from calibre.utils.zipfile import ZipFile
        self.options = options
        self.log = log
        pages, images = [], []
        toc = TOC()
        if file_ext == 'pmlz':
            log.debug('De-compressing content to temporary directory...')
            with TemporaryDirectory('_unpmlz') as tdir:
                zf = ZipFile(stream)
                zf.extractall(tdir)
                pmls = glob.glob(os.path.join(tdir, '*.pml'))
                for pml in pmls:
                    html_name = os.path.splitext(os.path.basename(pml))[0]+'.html'
                    html_path = os.path.join(getcwd(), html_name)
                    pages.append(html_name)
                    log.debug('Processing PML item %s...' % pml)
                    ttoc = self.process_pml(pml, html_path)
                    toc += ttoc
                images = self.get_images(stream, tdir, True)
        else:
            toc = self.process_pml(stream, 'index.html')
            pages.append('index.html')
            if hasattr(stream, 'name'):
                images = self.get_images(stream, os.path.abspath(os.path.dirname(stream.name)))
        # We want pages to be orded alphabetically.
        pages.sort()
        manifest_items = []
        for item in pages+images:
            manifest_items.append((item, None))
        from calibre.ebooks.metadata.meta import get_metadata
        log.debug('Reading metadata from input file...')
        mi = get_metadata(stream, 'pml')
        if 'images/cover.png' in images:
            mi.cover = 'images/cover.png'
        opf = OPFCreator(getcwd(), mi)
        log.debug('Generating manifest...')
        opf.create_manifest(manifest_items)
        opf.create_spine(pages)
        opf.set_toc(toc)
        with lopen('metadata.opf', 'wb') as opffile:
            with lopen('toc.ncx', 'wb') as tocfile:
                opf.render(opffile, tocfile, 'toc.ncx')
        return os.path.join(getcwd(), 'metadata.opf')
    def postprocess_book(self, oeb, opts, log):
        from calibre.ebooks.oeb.base import XHTML, barename
        for item in oeb.spine:
            if hasattr(item.data, 'xpath'):
                for heading in item.data.iterdescendants(*map(XHTML, 'h1 h2 h3 h4 h5 h6'.split())):
                    if not len(heading):
                        continue
                    span = heading[0]
                    if not heading.text and not span.text and not len(span) and barename(span.tag) == 'span':
                        if not heading.get('id') and span.get('id'):
                            heading.set('id', span.get('id'))
                            heading.text = span.tail
                            heading.remove(span)
                    if len(heading) == 1 and heading[0].get('style') == 'text-align: center; margin: auto;':
                        div = heading[0]
                        if barename(div.tag) == 'div' and not len(div) and not div.get('id') and not heading.get('style'):
                            heading.text = (heading.text or '') + (div.text or '') + (div.tail or '')
                            heading.remove(div)
                            heading.set('style', 'text-align: center')
--- a/ebook_converter/ebooks/conversion/plugins/pml_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/pml_output.py
@@ -0,0 +1,77 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os, io
 from calibre.customize.conversion import (OutputFormatPlugin,
        OptionRecommendation)
 from calibre.ptempfile import TemporaryDirectory
 from polyglot.builtins import unicode_type
 class PMLOutput(OutputFormatPlugin):
    name = 'PML Output'
    author = 'John Schember'
    file_type = 'pmlz'
    commit_name = 'pml_output'
    options = {
        OptionRecommendation(name='pml_output_encoding', recommended_value='cp1252',
            level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the output document. '
            'The default is cp1252.')),
        OptionRecommendation(name='inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Add Table of Contents to beginning of the book.')),
        OptionRecommendation(name='full_image_depth',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Do not reduce the size or bit depth of images. Images '
                   'have their size and depth reduced by default to accommodate '
                   'applications that can not convert images on their '
                   'own such as Dropbook.')),
    }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.pml.pmlml import PMLMLizer
        from calibre.utils.zipfile import ZipFile
        with TemporaryDirectory('_pmlz_output') as tdir:
            pmlmlizer = PMLMLizer(log)
            pml = unicode_type(pmlmlizer.extract_content(oeb_book, opts))
            with lopen(os.path.join(tdir, 'index.pml'), 'wb') as out:
                out.write(pml.encode(opts.pml_output_encoding, 'replace'))
            img_path = os.path.join(tdir, 'index_img')
            if not os.path.exists(img_path):
                os.makedirs(img_path)
            self.write_images(oeb_book.manifest, pmlmlizer.image_hrefs, img_path, opts)
            log.debug('Compressing output...')
            pmlz = ZipFile(output_path, 'w')
            pmlz.add_dir(tdir)
    def write_images(self, manifest, image_hrefs, out_dir, opts):
        from PIL import Image
        from calibre.ebooks.oeb.base import OEB_RASTER_IMAGES
        for item in manifest:
            if item.media_type in OEB_RASTER_IMAGES and item.href in image_hrefs.keys():
                if opts.full_image_depth:
                    im = Image.open(io.BytesIO(item.data))
                else:
                    im = Image.open(io.BytesIO(item.data)).convert('P')
                    im.thumbnail((300,300), Image.ANTIALIAS)
                data = io.BytesIO()
                im.save(data, 'PNG')
                data = data.getvalue()
                path = os.path.join(out_dir, image_hrefs[item.href])
                with lopen(path, 'wb') as out:
                    out.write(data)
--- a/ebook_converter/ebooks/conversion/plugins/rb_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/rb_input.py
@@ -0,0 +1,28 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from calibre.customize.conversion import InputFormatPlugin
 from polyglot.builtins import getcwd
 class RBInput(InputFormatPlugin):
    name        = 'RB Input'
    author      = 'John Schember'
    description = 'Convert RB files to HTML'
    file_types  = {'rb'}
    commit_name = 'rb_input'
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.rb.reader import Reader
        reader = Reader(stream, log, options.input_encoding)
        opf = reader.extract_content(getcwd())
        return opf
--- a/ebook_converter/ebooks/conversion/plugins/rb_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/rb_output.py
@@ -0,0 +1,45 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
 class RBOutput(OutputFormatPlugin):
    name = 'RB Output'
    author = 'John Schember'
    file_type = 'rb'
    commit_name = 'rb_output'
    options = {
        OptionRecommendation(name='inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Add Table of Contents to beginning of the book.'))}
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.rb.writer import RBWriter
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
                os.makedirs(os.path.dirname(output_path))
            out_stream = lopen(output_path, 'wb')
        else:
            out_stream = output_path
        writer = RBWriter(opts, log)
        out_stream.seek(0)
        out_stream.truncate()
        writer.write_content(oeb_book, out_stream, oeb_book.metadata)
        if close:
            out_stream.close()
--- a/ebook_converter/ebooks/conversion/plugins/recipe_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/recipe_input.py
@@ -0,0 +1,169 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from calibre.constants import numeric_version
 from calibre import walk
 from polyglot.builtins import unicode_type
 class RecipeDisabled(Exception):
    pass
 class RecipeInput(InputFormatPlugin):
    name        = 'Recipe Input'
    author      = 'Kovid Goyal'
    description = _('Download periodical content from the internet')
    file_types  = {'recipe', 'downloaded_recipe'}
    commit_name = 'recipe_input'
    recommendations = {
        ('chapter', None, OptionRecommendation.HIGH),
        ('dont_split_on_page_breaks', True, OptionRecommendation.HIGH),
        ('use_auto_toc', False, OptionRecommendation.HIGH),
        ('input_encoding', None, OptionRecommendation.HIGH),
        ('input_profile', 'default', OptionRecommendation.HIGH),
        ('page_breaks_before', None, OptionRecommendation.HIGH),
        ('insert_metadata', False, OptionRecommendation.HIGH),
        }
    options = {
        OptionRecommendation(name='test', recommended_value=False,
            help=_(
            'Useful for recipe development. Forces'
            ' max_articles_per_feed to 2 and downloads at most 2 feeds.'
            ' You can change the number of feeds and articles by supplying optional arguments.'
            ' For example: --test 3 1 will download at most 3 feeds and only 1 article per feed.')),
        OptionRecommendation(name='username', recommended_value=None,
            help=_('Username for sites that require a login to access '
                'content.')),
        OptionRecommendation(name='password', recommended_value=None,
            help=_('Password for sites that require a login to access '
                'content.')),
        OptionRecommendation(name='dont_download_recipe',
            recommended_value=False,
            help=_('Do not download latest version of builtin recipes from the calibre server')),
        OptionRecommendation(name='lrf', recommended_value=False,
            help='Optimize fetching for subsequent conversion to LRF.'),
        }
    def convert(self, recipe_or_file, opts, file_ext, log,
            accelerators):
        from calibre.web.feeds.recipes import compile_recipe
        opts.output_profile.flow_size = 0
        if file_ext == 'downloaded_recipe':
            from calibre.utils.zipfile import ZipFile
            zf = ZipFile(recipe_or_file, 'r')
            zf.extractall()
            zf.close()
            with lopen('download.recipe', 'rb') as f:
                self.recipe_source = f.read()
            recipe = compile_recipe(self.recipe_source)
            recipe.needs_subscription = False
            self.recipe_object = recipe(opts, log, self.report_progress)
        else:
            if os.environ.get('CALIBRE_RECIPE_URN'):
                from calibre.web.feeds.recipes.collection import get_custom_recipe, get_builtin_recipe_by_id
                urn = os.environ['CALIBRE_RECIPE_URN']
                log('Downloading recipe urn: ' + urn)
                rtype, recipe_id = urn.partition(':')[::2]
                if not recipe_id:
                    raise ValueError('Invalid recipe urn: ' + urn)
                if rtype == 'custom':
                    self.recipe_source = get_custom_recipe(recipe_id)
                else:
                    self.recipe_source = get_builtin_recipe_by_id(urn, log=log, download_recipe=True)
                if not self.recipe_source:
                    raise ValueError('Could not find recipe with urn: ' + urn)
                if not isinstance(self.recipe_source, bytes):
                    self.recipe_source = self.recipe_source.encode('utf-8')
                recipe = compile_recipe(self.recipe_source)
            elif os.access(recipe_or_file, os.R_OK):
                with lopen(recipe_or_file, 'rb') as f:
                    self.recipe_source = f.read()
                recipe = compile_recipe(self.recipe_source)
                log('Using custom recipe')
            else:
                from calibre.web.feeds.recipes.collection import (
                        get_builtin_recipe_by_title, get_builtin_recipe_titles)
                title = getattr(opts, 'original_recipe_input_arg', recipe_or_file)
                title = os.path.basename(title).rpartition('.')[0]
                titles = frozenset(get_builtin_recipe_titles())
                if title not in titles:
                    title = getattr(opts, 'original_recipe_input_arg', recipe_or_file)
                    title = title.rpartition('.')[0]
                raw = get_builtin_recipe_by_title(title, log=log,
                        download_recipe=not opts.dont_download_recipe)
                builtin = False
                try:
                    recipe = compile_recipe(raw)
                    self.recipe_source = raw
                    if recipe.requires_version > numeric_version:
                        log.warn(
                        'Downloaded recipe needs calibre version at least: %s' %
                        ('.'.join(recipe.requires_version)))
                        builtin = True
                except:
                    log.exception('Failed to compile downloaded recipe. Falling '
                            'back to builtin one')
                    builtin = True
                if builtin:
                    log('Using bundled builtin recipe')
                    raw = get_builtin_recipe_by_title(title, log=log,
                            download_recipe=False)
                    if raw is None:
                        raise ValueError('Failed to find builtin recipe: '+title)
                    recipe = compile_recipe(raw)
                    self.recipe_source = raw
                else:
                    log('Using downloaded builtin recipe')
            if recipe is None:
                raise ValueError('%r is not a valid recipe file or builtin recipe' %
                        recipe_or_file)
            disabled = getattr(recipe, 'recipe_disabled', None)
            if disabled is not None:
                raise RecipeDisabled(disabled)
            ro = recipe(opts, log, self.report_progress)
            ro.download()
            self.recipe_object = ro
        for key, val in self.recipe_object.conversion_options.items():
            setattr(opts, key, val)
        for f in os.listdir('.'):
            if f.endswith('.opf'):
                return os.path.abspath(f)
        for f in walk('.'):
            if f.endswith('.opf'):
                return os.path.abspath(f)
    def postprocess_book(self, oeb, opts, log):
        if self.recipe_object is not None:
            self.recipe_object.internal_postprocess_book(oeb, opts, log)
            self.recipe_object.postprocess_book(oeb, opts, log)
    def specialize(self, oeb, opts, log, output_fmt):
        if opts.no_inline_navbars:
            from calibre.ebooks.oeb.base import XPath
            for item in oeb.spine:
                for div in XPath('//h:div[contains(@class, "calibre_navbar")]')(item.data):
                    div.getparent().remove(div)
    def save_download(self, zf):
        raw = self.recipe_source
        if isinstance(raw, unicode_type):
            raw = raw.encode('utf-8')
        zf.writestr('download.recipe', raw)
--- a/ebook_converter/ebooks/conversion/plugins/rtf_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/rtf_input.py
@@ -0,0 +1,323 @@
 from __future__ import with_statement, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import os, glob, re, textwrap
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from polyglot.builtins import iteritems, filter, getcwd, as_bytes
 border_style_map = {
        'single' : 'solid',
        'double-thickness-border' : 'double',
        'shadowed-border': 'outset',
        'double-border': 'double',
        'dotted-border': 'dotted',
        'dashed': 'dashed',
        'hairline': 'solid',
        'inset': 'inset',
        'dash-small': 'dashed',
        'dot-dash': 'dotted',
        'dot-dot-dash': 'dotted',
        'outset': 'outset',
        'tripple': 'double',
        'triple': 'double',
        'thick-thin-small': 'solid',
        'thin-thick-small': 'solid',
        'thin-thick-thin-small': 'solid',
        'thick-thin-medium': 'solid',
        'thin-thick-medium': 'solid',
        'thin-thick-thin-medium': 'solid',
        'thick-thin-large': 'solid',
        'thin-thick-thin-large': 'solid',
        'wavy': 'ridge',
        'double-wavy': 'ridge',
        'striped': 'ridge',
        'emboss': 'inset',
        'engrave': 'inset',
        'frame': 'ridge',
 }
 class RTFInput(InputFormatPlugin):
    name        = 'RTF Input'
    author      = 'Kovid Goyal'
    description = 'Convert RTF files to HTML'
    file_types  = {'rtf'}
    commit_name = 'rtf_input'
    options = {
        OptionRecommendation(name='ignore_wmf', recommended_value=False,
            help=_('Ignore WMF images instead of replacing them with a placeholder image.')),
    }
    def generate_xml(self, stream):
        from calibre.ebooks.rtf2xml.ParseRtf import ParseRtf
        ofile = u'dataxml.xml'
        run_lev, debug_dir, indent_out = 1, None, 0
        if getattr(self.opts, 'debug_pipeline', None) is not None:
            try:
                os.mkdir(u'rtfdebug')
                debug_dir = u'rtfdebug'
                run_lev = 4
                indent_out = 1
                self.log('Running RTFParser in debug mode')
            except:
                self.log.warn('Impossible to run RTFParser in debug mode')
        parser = ParseRtf(
            in_file=stream,
            out_file=ofile,
            # Convert symbol fonts to unicode equivalents. Default
            # is 1
            convert_symbol=1,
            # Convert Zapf fonts to unicode equivalents. Default
            # is 1.
            convert_zapf=1,
            # Convert Wingding fonts to unicode equivalents.
            # Default is 1.
            convert_wingdings=1,
            # Convert RTF caps to real caps.
            # Default is 1.
            convert_caps=1,
            # Indent resulting XML.
            # Default is 0 (no indent).
            indent=indent_out,
            # Form lists from RTF. Default is 1.
            form_lists=1,
            # Convert headings to sections. Default is 0.
            headings_to_sections=1,
            # Group paragraphs with the same style name. Default is 1.
            group_styles=1,
            # Group borders. Default is 1.
            group_borders=1,
            # Write or do not write paragraphs. Default is 0.
            empty_paragraphs=1,
            # Debug
            deb_dir=debug_dir,
            # Default encoding
            default_encoding=getattr(self.opts, 'input_encoding', 'cp1252') or 'cp1252',
            # Run level
            run_level=run_lev,
        )
        parser.parse_rtf()
        with open(ofile, 'rb') as f:
            return f.read()
    def extract_images(self, picts):
        from calibre.utils.imghdr import what
        from binascii import unhexlify
        self.log('Extracting images...')
        with open(picts, 'rb') as f:
            raw = f.read()
        picts = filter(len, re.findall(br'\{\\pict([^}]+)\}', raw))
        hex_pat = re.compile(br'[^a-fA-F0-9]')
        encs = [hex_pat.sub(b'', pict) for pict in picts]
        count = 0
        imap = {}
        for enc in encs:
            if len(enc) % 2 == 1:
                enc = enc[:-1]
            data = unhexlify(enc)
            fmt = what(None, data)
            if fmt is None:
                fmt = 'wmf'
            count += 1
            name = u'%04d.%s' % (count, fmt)
            with open(name, 'wb') as f:
                f.write(data)
            imap[count] = name
            # with open(name+'.hex', 'wb') as f:
            #     f.write(enc)
        return self.convert_images(imap)
    def convert_images(self, imap):
        self.default_img = None
        for count, val in iteritems(imap):
            try:
                imap[count] = self.convert_image(val)
            except:
                self.log.exception('Failed to convert', val)
        return imap
    def convert_image(self, name):
        if not name.endswith('.wmf'):
            return name
        try:
            return self.rasterize_wmf(name)
        except Exception:
            self.log.exception('Failed to convert WMF image %r'%name)
        return self.replace_wmf(name)
    def replace_wmf(self, name):
        if self.opts.ignore_wmf:
            os.remove(name)
            return '__REMOVE_ME__'
        from calibre.ebooks.covers import message_image
        if self.default_img is None:
            self.default_img = message_image('Conversion of WMF images is not supported.'
            ' Use Microsoft Word or OpenOffice to save this RTF file'
            ' as HTML and convert that in calibre.')
        name = name.replace('.wmf', '.jpg')
        with lopen(name, 'wb') as f:
            f.write(self.default_img)
        return name
    def rasterize_wmf(self, name):
        from calibre.utils.wmf.parse import wmf_unwrap
        with open(name, 'rb') as f:
            data = f.read()
        data = wmf_unwrap(data)
        name = name.replace('.wmf', '.png')
        with open(name, 'wb') as f:
            f.write(data)
        return name
    def write_inline_css(self, ic, border_styles):
        font_size_classes = ['span.fs%d { font-size: %spt }'%(i, x) for i, x in
                enumerate(ic.font_sizes)]
        color_classes = ['span.col%d { color: %s }'%(i, x) for i, x in
                enumerate(ic.colors) if x != 'false']
        css = textwrap.dedent('''
        span.none {
            text-decoration: none; font-weight: normal;
            font-style: normal; font-variant: normal
        }
        span.italics { font-style: italic }
        span.bold { font-weight: bold }
        span.small-caps { font-variant: small-caps }
        span.underlined { text-decoration: underline }
        span.strike-through { text-decoration: line-through }
        ''')
        css += '\n'+'\n'.join(font_size_classes)
        css += '\n' +'\n'.join(color_classes)
        for cls, val in iteritems(border_styles):
            css += '\n\n.%s {\n%s\n}'%(cls, val)
        with open(u'styles.css', 'ab') as f:
            f.write(css.encode('utf-8'))
    def convert_borders(self, doc):
        border_styles = []
        style_map = {}
        for elem in doc.xpath(r'//*[local-name()="cell"]'):
            style = ['border-style: hidden', 'border-width: 1px',
                    'border-color: black']
            for x in ('bottom', 'top', 'left', 'right'):
                bs = elem.get('border-cell-%s-style'%x, None)
                if bs:
                    cbs = border_style_map.get(bs, 'solid')
                    style.append('border-%s-style: %s'%(x, cbs))
                bw = elem.get('border-cell-%s-line-width'%x, None)
                if bw:
                    style.append('border-%s-width: %spt'%(x, bw))
                bc = elem.get('border-cell-%s-color'%x, None)
                if bc:
                    style.append('border-%s-color: %s'%(x, bc))
            style = ';\n'.join(style)
            if style not in border_styles:
                border_styles.append(style)
            idx = border_styles.index(style)
            cls = 'border_style%d'%idx
            style_map[cls] = style
            elem.set('class', cls)
        return style_map
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from lxml import etree
        from calibre.ebooks.metadata.meta import get_metadata
        from calibre.ebooks.metadata.opf2 import OPFCreator
        from calibre.ebooks.rtf2xml.ParseRtf import RtfInvalidCodeException
        from calibre.ebooks.rtf.input import InlineClass
        from calibre.utils.xml_parse import safe_xml_fromstring
        self.opts = options
        self.log = log
        self.log('Converting RTF to XML...')
        try:
            xml = self.generate_xml(stream.name)
        except RtfInvalidCodeException as e:
            self.log.exception('Unable to parse RTF')
            raise ValueError(_('This RTF file has a feature calibre does not '
            'support. Convert it to HTML first and then try it.\n%s')%e)
        d = glob.glob(os.path.join('*_rtf_pict_dir', 'picts.rtf'))
        if d:
            imap = {}
            try:
                imap = self.extract_images(d[0])
            except:
                self.log.exception('Failed to extract images...')
        self.log('Parsing XML...')
        doc = safe_xml_fromstring(xml)
        border_styles = self.convert_borders(doc)
        for pict in doc.xpath('//rtf:pict[@num]',
                namespaces={'rtf':'http://rtf2xml.sourceforge.net/'}):
            num = int(pict.get('num'))
            name = imap.get(num, None)
            if name is not None:
                pict.set('num', name)
        self.log('Converting XML to HTML...')
        inline_class = InlineClass(self.log)
        styledoc = safe_xml_fromstring(P('templates/rtf.xsl', data=True), recover=False)
        extensions = {('calibre', 'inline-class') : inline_class}
        transform = etree.XSLT(styledoc, extensions=extensions)
        result = transform(doc)
        html = u'index.xhtml'
        with open(html, 'wb') as f:
            res = as_bytes(transform.tostring(result))
            # res = res[:100].replace('xmlns:html', 'xmlns') + res[100:]
            # clean multiple \n
            res = re.sub(b'\n+', b'\n', res)
            # Replace newlines inserted by the 'empty_paragraphs' option in rtf2xml with html blank lines
            # res = re.sub('\s*<body>', '<body>', res)
            # res = re.sub('(?<=\n)\n{2}',
            # u'<p>\u00a0</p>\n'.encode('utf-8'), res)
            f.write(res)
        self.write_inline_css(inline_class, border_styles)
        stream.seek(0)
        mi = get_metadata(stream, 'rtf')
        if not mi.title:
            mi.title = _('Unknown')
        if not mi.authors:
            mi.authors = [_('Unknown')]
        opf = OPFCreator(getcwd(), mi)
        opf.create_manifest([(u'index.xhtml', None)])
        opf.create_spine([u'index.xhtml'])
        opf.render(open(u'metadata.opf', 'wb'))
        return os.path.abspath(u'metadata.opf')
    def postprocess_book(self, oeb, opts, log):
        for item in oeb.spine:
            for img in item.data.xpath('//*[local-name()="img" and @src="__REMOVE_ME__"]'):
                p = img.getparent()
                idx = p.index(img)
                p.remove(img)
                if img.tail:
                    if idx == 0:
                        p.text = (p.text or '') + img.tail
                    else:
                        p[idx-1].tail = (p[idx-1].tail or '') + img.tail
--- a/ebook_converter/ebooks/conversion/plugins/rtf_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/rtf_output.py
@@ -0,0 +1,40 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin
 class RTFOutput(OutputFormatPlugin):
    name = 'RTF Output'
    author = 'John Schember'
    file_type = 'rtf'
    commit_name = 'rtf_output'
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.rtf.rtfml import RTFMLizer
        rtfmlitzer = RTFMLizer(log)
        content = rtfmlitzer.extract_content(oeb_book, opts)
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
                os.makedirs(os.path.dirname(output_path))
            out_stream = lopen(output_path, 'wb')
        else:
            out_stream = output_path
        out_stream.seek(0)
        out_stream.truncate()
        out_stream.write(content.encode('ascii', 'replace'))
        if close:
            out_stream.close()
--- a/ebook_converter/ebooks/conversion/plugins/snb_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/snb_input.py
@@ -0,0 +1,122 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2010, Li Fanxi <lifanxi@freemindworld.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import InputFormatPlugin
 from calibre.ptempfile import TemporaryDirectory
 from calibre.utils.filenames import ascii_filename
 from polyglot.builtins import unicode_type
 HTML_TEMPLATE = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>%s</title></head><body>\n%s\n</body></html>'
 def html_encode(s):
    return s.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&apos;').replace('\n', '<br/>').replace(' ', '&nbsp;')  # noqa
 class SNBInput(InputFormatPlugin):
    name        = 'SNB Input'
    author      = 'Li Fanxi'
    description = 'Convert SNB files to OEB'
    file_types  = {'snb'}
    commit_name = 'snb_input'
    options = set()
    def convert(self, stream, options, file_ext, log,
                accelerators):
        import uuid
        from calibre.ebooks.oeb.base import DirContainer
        from calibre.ebooks.snb.snbfile import SNBFile
        from calibre.utils.xml_parse import safe_xml_fromstring
        log.debug("Parsing SNB file...")
        snbFile = SNBFile()
        try:
            snbFile.Parse(stream)
        except:
            raise ValueError("Invalid SNB file")
        if not snbFile.IsValid():
            log.debug("Invalid SNB file")
            raise ValueError("Invalid SNB file")
        log.debug("Handle meta data ...")
        from calibre.ebooks.conversion.plumber import create_oebbook
        oeb = create_oebbook(log, None, options,
                encoding=options.input_encoding, populate=False)
        meta = snbFile.GetFileStream('snbf/book.snbf')
        if meta is not None:
            meta = safe_xml_fromstring(meta)
            l = {'title'    : './/head/name',
                  'creator'  : './/head/author',
                  'language' : './/head/language',
                  'generator': './/head/generator',
                  'publisher': './/head/publisher',
                  'cover'    : './/head/cover', }
            d = {}
            for item in l:
                node = meta.find(l[item])
                if node is not None:
                    d[item] = node.text if node.text is not None else ''
                else:
                    d[item] = ''
            oeb.metadata.add('title', d['title'])
            oeb.metadata.add('creator', d['creator'], attrib={'role':'aut'})
            oeb.metadata.add('language', d['language'].lower().replace('_', '-'))
            oeb.metadata.add('generator', d['generator'])
            oeb.metadata.add('publisher', d['publisher'])
            if d['cover'] != '':
                oeb.guide.add('cover', 'Cover', d['cover'])
        bookid = unicode_type(uuid.uuid4())
        oeb.metadata.add('identifier', bookid, id='uuid_id', scheme='uuid')
        for ident in oeb.metadata.identifier:
            if 'id' in ident.attrib:
                oeb.uid = oeb.metadata.identifier[0]
                break
        with TemporaryDirectory('_snb2oeb', keep=True) as tdir:
            log.debug('Process TOC ...')
            toc = snbFile.GetFileStream('snbf/toc.snbf')
            oeb.container = DirContainer(tdir, log)
            if toc is not None:
                toc = safe_xml_fromstring(toc)
                i = 1
                for ch in toc.find('.//body'):
                    chapterName = ch.text
                    chapterSrc = ch.get('src')
                    fname = 'ch_%d.htm' % i
                    data = snbFile.GetFileStream('snbc/' + chapterSrc)
                    if data is None:
                        continue
                    snbc = safe_xml_fromstring(data)
                    lines = []
                    for line in snbc.find('.//body'):
                        if line.tag == 'text':
                            lines.append('<p>%s</p>' % html_encode(line.text))
                        elif line.tag == 'img':
                            lines.append('<p><img src="%s" /></p>' % html_encode(line.text))
                    with open(os.path.join(tdir, fname), 'wb') as f:
                        f.write((HTML_TEMPLATE % (chapterName, '\n'.join(lines))).encode('utf-8', 'replace'))
                    oeb.toc.add(ch.text, fname)
                    id, href = oeb.manifest.generate(id='html',
                        href=ascii_filename(fname))
                    item = oeb.manifest.add(id, href, 'text/html')
                    item.html_input_href = fname
                    oeb.spine.add(item, True)
                    i = i + 1
                imageFiles = snbFile.OutputImageFiles(tdir)
                for f, m in imageFiles:
                    id, href = oeb.manifest.generate(id='image',
                        href=ascii_filename(f))
                    item = oeb.manifest.add(id, href, m)
                    item.html_input_href = f
        return oeb
--- a/ebook_converter/ebooks/conversion/plugins/snb_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/snb_output.py
@@ -0,0 +1,269 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2010, Li Fanxi <lifanxi@freemindworld.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin, OptionRecommendation
 from calibre.ptempfile import TemporaryDirectory
 from calibre.constants import __appname__, __version__
 from polyglot.builtins import unicode_type
 class SNBOutput(OutputFormatPlugin):
    name = 'SNB Output'
    author = 'Li Fanxi'
    file_type = 'snb'
    commit_name = 'snb_output'
    options = {
        OptionRecommendation(name='snb_output_encoding', recommended_value='utf-8',
            level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the output document. '
            'The default is utf-8.')),
        OptionRecommendation(name='snb_max_line_length',
            recommended_value=0, level=OptionRecommendation.LOW,
            help=_('The maximum number of characters per line. This splits on '
            'the first space before the specified value. If no space is found '
            'the line will be broken at the space after and will exceed the '
            'specified value. Also, there is a minimum of 25 characters. '
            'Use 0 to disable line splitting.')),
        OptionRecommendation(name='snb_insert_empty_line',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Specify whether or not to insert an empty line between '
            'two paragraphs.')),
        OptionRecommendation(name='snb_dont_indent_first_line',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Specify whether or not to insert two space characters '
            'to indent the first line of each paragraph.')),
        OptionRecommendation(name='snb_hide_chapter_name',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Specify whether or not to hide the chapter title for each '
            'chapter. Useful for image-only output (eg. comics).')),
        OptionRecommendation(name='snb_full_screen',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Resize all the images for full screen view. ')),
     }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from lxml import etree
        from calibre.ebooks.snb.snbfile import SNBFile
        from calibre.ebooks.snb.snbml import SNBMLizer, ProcessFileName
        self.opts = opts
        from calibre.ebooks.oeb.transforms.rasterize import SVGRasterizer, Unavailable
        try:
            rasterizer = SVGRasterizer()
            rasterizer(oeb_book, opts)
        except Unavailable:
            log.warn('SVG rasterizer unavailable, SVG will not be converted')
        # Create temp dir
        with TemporaryDirectory('_snb_output') as tdir:
            # Create stub directories
            snbfDir = os.path.join(tdir, 'snbf')
            snbcDir = os.path.join(tdir, 'snbc')
            snbiDir = os.path.join(tdir, 'snbc/images')
            os.mkdir(snbfDir)
            os.mkdir(snbcDir)
            os.mkdir(snbiDir)
            # Process Meta data
            meta = oeb_book.metadata
            if meta.title:
                title = unicode_type(meta.title[0])
            else:
                title = ''
            authors = [unicode_type(x) for x in meta.creator if x.role == 'aut']
            if meta.publisher:
                publishers = unicode_type(meta.publisher[0])
            else:
                publishers = ''
            if meta.language:
                lang = unicode_type(meta.language[0]).upper()
            else:
                lang = ''
            if meta.description:
                abstract = unicode_type(meta.description[0])
            else:
                abstract = ''
            # Process Cover
            g, m, s = oeb_book.guide, oeb_book.manifest, oeb_book.spine
            href = None
            if 'titlepage' not in g:
                if 'cover' in g:
                    href = g['cover'].href
            # Output book info file
            bookInfoTree = etree.Element("book-snbf", version="1.0")
            headTree = etree.SubElement(bookInfoTree, "head")
            etree.SubElement(headTree, "name").text = title
            etree.SubElement(headTree, "author").text = ' '.join(authors)
            etree.SubElement(headTree, "language").text = lang
            etree.SubElement(headTree, "rights")
            etree.SubElement(headTree, "publisher").text = publishers
            etree.SubElement(headTree, "generator").text = __appname__ + ' ' + __version__
            etree.SubElement(headTree, "created")
            etree.SubElement(headTree, "abstract").text = abstract
            if href is not None:
                etree.SubElement(headTree, "cover").text = ProcessFileName(href)
            else:
                etree.SubElement(headTree, "cover")
            with open(os.path.join(snbfDir, 'book.snbf'), 'wb') as f:
                f.write(etree.tostring(bookInfoTree, pretty_print=True, encoding='utf-8'))
            # Output TOC
            tocInfoTree = etree.Element("toc-snbf")
            tocHead = etree.SubElement(tocInfoTree, "head")
            tocBody = etree.SubElement(tocInfoTree, "body")
            outputFiles = {}
            if oeb_book.toc.count() == 0:
                log.warn('This SNB file has no Table of Contents. '
                    'Creating a default TOC')
                first = next(iter(oeb_book.spine))
                oeb_book.toc.add(_('Start page'), first.href)
            else:
                first = next(iter(oeb_book.spine))
                if oeb_book.toc[0].href != first.href:
                    # The pages before the fist item in toc will be stored as
                    # "Cover Pages".
                    # oeb_book.toc does not support "insert", so we generate
                    # the tocInfoTree directly instead of modifying the toc
                    ch = etree.SubElement(tocBody, "chapter")
                    ch.set("src", ProcessFileName(first.href) + ".snbc")
                    ch.text = _('Cover pages')
                    outputFiles[first.href] = []
                    outputFiles[first.href].append(("", _("Cover pages")))
            for tocitem in oeb_book.toc:
                if tocitem.href.find('#') != -1:
                    item = tocitem.href.split('#')
                    if len(item) != 2:
                        log.error('Error in TOC item: %s' % tocitem)
                    else:
                        if item[0] in outputFiles:
                            outputFiles[item[0]].append((item[1], tocitem.title))
                        else:
                            outputFiles[item[0]] = []
                            if "" not in outputFiles[item[0]]:
                                outputFiles[item[0]].append(("", tocitem.title + _(" (Preface)")))
                                ch = etree.SubElement(tocBody, "chapter")
                                ch.set("src", ProcessFileName(item[0]) + ".snbc")
                                ch.text = tocitem.title + _(" (Preface)")
                            outputFiles[item[0]].append((item[1], tocitem.title))
                else:
                    if tocitem.href in outputFiles:
                        outputFiles[tocitem.href].append(("", tocitem.title))
                    else:
                        outputFiles[tocitem.href] = []
                        outputFiles[tocitem.href].append(("", tocitem.title))
                ch = etree.SubElement(tocBody, "chapter")
                ch.set("src", ProcessFileName(tocitem.href) + ".snbc")
                ch.text = tocitem.title
            etree.SubElement(tocHead, "chapters").text = '%d' % len(tocBody)
            with open(os.path.join(snbfDir, 'toc.snbf'), 'wb') as f:
                f.write(etree.tostring(tocInfoTree, pretty_print=True, encoding='utf-8'))
            # Output Files
            oldTree = None
            mergeLast = False
            lastName = None
            for item in s:
                from calibre.ebooks.oeb.base import OEB_DOCS, OEB_IMAGES
                if m.hrefs[item.href].media_type in OEB_DOCS:
                    if item.href not in outputFiles:
                        log.debug('File %s is unused in TOC. Continue in last chapter' % item.href)
                        mergeLast = True
                    else:
                        if oldTree is not None and mergeLast:
                            log.debug('Output the modified chapter again: %s' % lastName)
                            with open(os.path.join(snbcDir, lastName), 'wb') as f:
                                f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
                            mergeLast = False
                    log.debug('Converting %s to snbc...' % item.href)
                    snbwriter = SNBMLizer(log)
                    snbcTrees = None
                    if not mergeLast:
                        snbcTrees = snbwriter.extract_content(oeb_book, item, outputFiles[item.href], opts)
                        for subName in snbcTrees:
                            postfix = ''
                            if subName != '':
                                postfix = '_' + subName
                            lastName = ProcessFileName(item.href + postfix + ".snbc")
                            oldTree = snbcTrees[subName]
                            with open(os.path.join(snbcDir, lastName), 'wb') as f:
                                f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
                    else:
                        log.debug('Merge %s with last TOC item...' % item.href)
                        snbwriter.merge_content(oldTree, oeb_book, item, [('', _("Start"))], opts)
            # Output the last one if needed
            log.debug('Output the last modified chapter again: %s' % lastName)
            if oldTree is not None and mergeLast:
                with open(os.path.join(snbcDir, lastName), 'wb') as f:
                    f.write(etree.tostring(oldTree, pretty_print=True, encoding='utf-8'))
                mergeLast = False
            for item in m:
                if m.hrefs[item.href].media_type in OEB_IMAGES:
                    log.debug('Converting image: %s ...' % item.href)
                    content = m.hrefs[item.href].data
                    # Convert & Resize image
                    self.HandleImage(content, os.path.join(snbiDir, ProcessFileName(item.href)))
            # Package as SNB File
            snbFile = SNBFile()
            snbFile.FromDir(tdir)
            snbFile.Output(output_path)
    def HandleImage(self, imageData, imagePath):
        from calibre.utils.img import image_from_data, resize_image, image_to_data
        img = image_from_data(imageData)
        x, y = img.width(), img.height()
        if self.opts:
            if self.opts.snb_full_screen:
                SCREEN_X, SCREEN_Y = self.opts.output_profile.screen_size
            else:
                SCREEN_X, SCREEN_Y = self.opts.output_profile.comic_screen_size
        else:
            SCREEN_X = 540
            SCREEN_Y = 700
        # Handle big image only
        if x > SCREEN_X or y > SCREEN_Y:
            xScale = float(x) / SCREEN_X
            yScale = float(y) / SCREEN_Y
            scale = max(xScale, yScale)
            # TODO : intelligent image rotation
            #     img = img.rotate(90)
            #     x,y = y,x
            img = resize_image(img, x // scale, y // scale)
        with lopen(imagePath, 'wb') as f:
            f.write(image_to_data(img, fmt=imagePath.rpartition('.')[-1]))
 if __name__ == '__main__':
    from calibre.ebooks.oeb.reader import OEBReader
    from calibre.ebooks.oeb.base import OEBBook
    from calibre.ebooks.conversion.preprocess import HTMLPreProcessor
    from calibre.customize.profiles import HanlinV3Output
    class OptionValues(object):
        pass
    opts = OptionValues()
    opts.output_profile = HanlinV3Output(None)
    html_preprocessor = HTMLPreProcessor(None, None, opts)
    from calibre.utils.logging import default_log
    oeb = OEBBook(default_log, html_preprocessor)
    reader = OEBReader
    reader()(oeb, '/tmp/bbb/processed/')
    SNBOutput(None).convert(oeb, '/tmp/test.snb', None, None, default_log)
--- a/ebook_converter/ebooks/conversion/plugins/tcr_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/tcr_input.py
@@ -0,0 +1,39 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 from io import BytesIO
 from calibre.customize.conversion import InputFormatPlugin
 class TCRInput(InputFormatPlugin):
    name        = 'TCR Input'
    author      = 'John Schember'
    description = 'Convert TCR files to HTML'
    file_types  = {'tcr'}
    commit_name = 'tcr_input'
    def convert(self, stream, options, file_ext, log, accelerators):
        from calibre.ebooks.compression.tcr import decompress
        log.info('Decompressing text...')
        raw_txt = decompress(stream)
        log.info('Converting text to OEB...')
        stream = BytesIO(raw_txt)
        from calibre.customize.ui import plugin_for_input_format
        txt_plugin = plugin_for_input_format('txt')
        for opt in txt_plugin.options:
            if not hasattr(self.options, opt.option.name):
                setattr(options, opt.option.name, opt.recommended_value)
        stream.seek(0)
        return txt_plugin.convert(stream, options,
                'txt', log, accelerators)
--- a/ebook_converter/ebooks/conversion/plugins/tcr_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/tcr_output.py
@@ -0,0 +1,56 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre.customize.conversion import OutputFormatPlugin, \
    OptionRecommendation
 class TCROutput(OutputFormatPlugin):
    name = 'TCR Output'
    author = 'John Schember'
    file_type = 'tcr'
    commit_name = 'tcr_output'
    options = {
        OptionRecommendation(name='tcr_output_encoding', recommended_value='utf-8',
            level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the output document. '
            'The default is utf-8.'))}
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.txt.txtml import TXTMLizer
        from calibre.ebooks.compression.tcr import compress
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path):
                os.makedirs(os.path.dirname(output_path))
            out_stream = lopen(output_path, 'wb')
        else:
            out_stream = output_path
        setattr(opts, 'flush_paras', False)
        setattr(opts, 'max_line_length', 0)
        setattr(opts, 'force_max_line_length', False)
        setattr(opts, 'indent_paras', False)
        writer = TXTMLizer(log)
        txt = writer.extract_content(oeb_book, opts).encode(opts.tcr_output_encoding, 'replace')
        log.info('Compressing text...')
        txt = compress(txt)
        out_stream.seek(0)
        out_stream.truncate()
        out_stream.write(txt)
        if close:
            out_stream.close()
--- a/ebook_converter/ebooks/conversion/plugins/txt_input.py
+++ b/ebook_converter/ebooks/conversion/plugins/txt_input.py
@@ -0,0 +1,308 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 from calibre import _ent_pat, walk, xml_entity_to_unicode
 from calibre.customize.conversion import InputFormatPlugin, OptionRecommendation
 from polyglot.builtins import getcwd
 MD_EXTENSIONS = {
    'abbr': _('Abbreviations'),
    'admonition': _('Support admonitions'),
    'attr_list': _('Add attribute to HTML tags'),
    'codehilite': _('Add code highlighting via Pygments'),
    'def_list': _('Definition lists'),
    'extra': _('Enables various common extensions'),
    'fenced_code': _('Alternative code block syntax'),
    'footnotes': _('Footnotes'),
    'legacy_attrs': _('Use legacy element attributes'),
    'legacy_em': _('Use legacy underscore handling for connected words'),
    'meta': _('Metadata in the document'),
    'nl2br': _('Treat newlines as hard breaks'),
    'sane_lists': _('Do not allow mixing list types'),
    'smarty': _('Use markdown\'s internal smartypants parser'),
    'tables': _('Support tables'),
    'toc': _('Generate a table of contents'),
    'wikilinks': _('Wiki style links'),
 }
 class TXTInput(InputFormatPlugin):
    name        = 'TXT Input'
    author      = 'John Schember'
    description = 'Convert TXT files to HTML'
    file_types  = {'txt', 'txtz', 'text', 'md', 'textile', 'markdown'}
    commit_name = 'txt_input'
    ui_data = {
        'md_extensions': MD_EXTENSIONS,
        'paragraph_types': {
            'auto': _('Try to auto detect paragraph type'),
            'block': _('Treat a blank line as a paragraph break'),
            'single': _('Assume every line is a paragraph'),
            'print': _('Assume every line starting with 2+ spaces or a tab starts a paragraph'),
            'unformatted': _('Most lines have hard line breaks, few/no blank lines or indents'),
            'off': _('Don\'t modify the paragraph structure'),
        },
        'formatting_types': {
            'auto': _('Automatically decide which formatting processor to use'),
            'plain': _('No formatting'),
            'heuristic': _('Use heuristics to determine chapter headings, italics, etc.'),
            'textile': _('Use the TexTile markup language'),
            'markdown': _('Use the Markdown markup language')
        },
    }
    options = {
        OptionRecommendation(name='formatting_type', recommended_value='auto',
            choices=list(ui_data['formatting_types']),
            help=_('Formatting used within the document.\n'
                   '* auto: {auto}\n'
                   '* plain: {plain}\n'
                   '* heuristic: {heuristic}\n'
                   '* textile: {textile}\n'
                   '* markdown: {markdown}\n'
                   'To learn more about markdown see {url}').format(
                       url='https://daringfireball.net/projects/markdown/', **ui_data['formatting_types'])
        ),
        OptionRecommendation(name='paragraph_type', recommended_value='auto',
            choices=list(ui_data['paragraph_types']),
            help=_('Paragraph structure to assume. The value of "off" is useful for formatted documents such as Markdown or Textile. '
                   'Choices are:\n'
                   '* auto: {auto}\n'
                   '* block: {block}\n'
                   '* single: {single}\n'
                   '* print:  {print}\n'
                   '* unformatted: {unformatted}\n'
                   '* off: {off}').format(**ui_data['paragraph_types'])
        ),
        OptionRecommendation(name='preserve_spaces', recommended_value=False,
            help=_('Normally extra spaces are condensed into a single space. '
                'With this option all spaces will be displayed.')),
        OptionRecommendation(name='txt_in_remove_indents', recommended_value=False,
            help=_('Normally extra space at the beginning of lines is retained. '
                   'With this option they will be removed.')),
        OptionRecommendation(name="markdown_extensions", recommended_value='footnotes, tables, toc',
            help=_('Enable extensions to markdown syntax. Extensions are formatting that is not part '
                   'of the standard markdown format. The extensions enabled by default: %default.\n'
                   'To learn more about markdown extensions, see {}\n'
                   'This should be a comma separated list of extensions to enable:\n'
                   ).format('https://python-markdown.github.io/extensions/') + '\n'.join('* %s: %s' % (k, MD_EXTENSIONS[k]) for k in sorted(MD_EXTENSIONS))),
    }
    def shift_file(self, fname, data):
        name, ext = os.path.splitext(fname)
        candidate = os.path.join(self.output_dir, fname)
        c = 0
        while os.path.exists(candidate):
            c += 1
            candidate = os.path.join(self.output_dir, '{}-{}{}'.format(name, c, ext))
        ans = candidate
        with open(ans, 'wb') as f:
            f.write(data)
        return f.name
    def fix_resources(self, html, base_dir):
        from html5_parser import parse
        root = parse(html)
        changed = False
        for img in root.xpath('//img[@src]'):
            src = img.get('src')
            prefix = src.split(':', 1)[0].lower()
            if prefix not in ('file', 'http', 'https', 'ftp') and not os.path.isabs(src):
                src = os.path.join(base_dir, src)
                if os.access(src, os.R_OK):
                    with open(src, 'rb') as f:
                        data = f.read()
                    f = self.shift_file(os.path.basename(src), data)
                    changed = True
                    img.set('src', os.path.basename(f))
        if changed:
            from lxml import etree
            html = etree.tostring(root, encoding='unicode')
        return html
    def convert(self, stream, options, file_ext, log,
                accelerators):
        from calibre.ebooks.conversion.preprocess import DocAnalysis, Dehyphenator
        from calibre.ebooks.chardet import detect
        from calibre.utils.zipfile import ZipFile
        from calibre.ebooks.txt.processor import (convert_basic,
                convert_markdown_with_metadata, separate_paragraphs_single_line,
                separate_paragraphs_print_formatted, preserve_spaces,
                detect_paragraph_type, detect_formatting_type,
                normalize_line_endings, convert_textile, remove_indents,
                block_to_single_line, separate_hard_scene_breaks)
        self.log = log
        txt = b''
        log.debug('Reading text from file...')
        length = 0
        base_dir = self.output_dir = getcwd()
        # Extract content from zip archive.
        if file_ext == 'txtz':
            zf = ZipFile(stream)
            zf.extractall('.')
            for x in walk('.'):
                if os.path.splitext(x)[1].lower() in ('.txt', '.text'):
                    with open(x, 'rb') as tf:
                        txt += tf.read() + b'\n\n'
        else:
            if getattr(stream, 'name', None):
                base_dir = os.path.dirname(stream.name)
            txt = stream.read()
            if file_ext in {'md', 'textile', 'markdown'}:
                options.formatting_type = {'md': 'markdown'}.get(file_ext, file_ext)
                log.info('File extension indicates particular formatting. '
                        'Forcing formatting type to: %s'%options.formatting_type)
                options.paragraph_type = 'off'
        # Get the encoding of the document.
        if options.input_encoding:
            ienc = options.input_encoding
            log.debug('Using user specified input encoding of %s' % ienc)
        else:
            det_encoding = detect(txt[:4096])
            det_encoding, confidence = det_encoding['encoding'], det_encoding['confidence']
            if det_encoding and det_encoding.lower().replace('_', '-').strip() in (
                    'gb2312', 'chinese', 'csiso58gb231280', 'euc-cn', 'euccn',
                    'eucgb2312-cn', 'gb2312-1980', 'gb2312-80', 'iso-ir-58'):
                # Microsoft Word exports to HTML with encoding incorrectly set to
                # gb2312 instead of gbk. gbk is a superset of gb2312, anyway.
                det_encoding = 'gbk'
            ienc = det_encoding
            log.debug('Detected input encoding as %s with a confidence of %s%%' % (ienc, confidence * 100))
        if not ienc:
            ienc = 'utf-8'
            log.debug('No input encoding specified and could not auto detect using %s' % ienc)
        # Remove BOM from start of txt as its presence can confuse markdown
        import codecs
        for bom in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE, codecs.BOM_UTF8, codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
            if txt.startswith(bom):
                txt = txt[len(bom):]
                break
        txt = txt.decode(ienc, 'replace')
        # Replace entities
        txt = _ent_pat.sub(xml_entity_to_unicode, txt)
        # Normalize line endings
        txt = normalize_line_endings(txt)
        # Determine the paragraph type of the document.
        if options.paragraph_type == 'auto':
            options.paragraph_type = detect_paragraph_type(txt)
            if options.paragraph_type == 'unknown':
                log.debug('Could not reliably determine paragraph type using block')
                options.paragraph_type = 'block'
            else:
                log.debug('Auto detected paragraph type as %s' % options.paragraph_type)
        # Detect formatting
        if options.formatting_type == 'auto':
            options.formatting_type = detect_formatting_type(txt)
            log.debug('Auto detected formatting as %s' % options.formatting_type)
        if options.formatting_type == 'heuristic':
            setattr(options, 'enable_heuristics', True)
            setattr(options, 'unwrap_lines', False)
            setattr(options, 'smarten_punctuation', True)
        # Reformat paragraphs to block formatting based on the detected type.
        # We don't check for block because the processor assumes block.
        # single and print at transformed to block for processing.
        if options.paragraph_type == 'single':
            txt = separate_paragraphs_single_line(txt)
        elif options.paragraph_type == 'print':
            txt = separate_hard_scene_breaks(txt)
            txt = separate_paragraphs_print_formatted(txt)
            txt = block_to_single_line(txt)
        elif options.paragraph_type == 'unformatted':
            from calibre.ebooks.conversion.utils import HeuristicProcessor
            # unwrap lines based on punctuation
            docanalysis = DocAnalysis('txt', txt)
            length = docanalysis.line_length(.5)
            preprocessor = HeuristicProcessor(options, log=getattr(self, 'log', None))
            txt = preprocessor.punctuation_unwrap(length, txt, 'txt')
            txt = separate_paragraphs_single_line(txt)
        elif options.paragraph_type == 'block':
            txt = separate_hard_scene_breaks(txt)
            txt = block_to_single_line(txt)
        if getattr(options, 'enable_heuristics', False) and getattr(options, 'dehyphenate', False):
            docanalysis = DocAnalysis('txt', txt)
            if not length:
                length = docanalysis.line_length(.5)
            dehyphenator = Dehyphenator(options.verbose, log=self.log)
            txt = dehyphenator(txt,'txt', length)
        # User requested transformation on the text.
        if options.txt_in_remove_indents:
            txt = remove_indents(txt)
        # Preserve spaces will replace multiple spaces to a space
        # followed by the &nbsp; entity.
        if options.preserve_spaces:
            txt = preserve_spaces(txt)
        # Process the text using the appropriate text processor.
        self.shifted_files = []
        try:
            html = ''
            input_mi = None
            if options.formatting_type == 'markdown':
                log.debug('Running text through markdown conversion...')
                try:
                    input_mi, html = convert_markdown_with_metadata(txt, extensions=[x.strip() for x in options.markdown_extensions.split(',') if x.strip()])
                except RuntimeError:
                    raise ValueError('This txt file has malformed markup, it cannot be'
                        ' converted by calibre. See https://daringfireball.net/projects/markdown/syntax')
                html = self.fix_resources(html, base_dir)
            elif options.formatting_type == 'textile':
                log.debug('Running text through textile conversion...')
                html = convert_textile(txt)
                html = self.fix_resources(html, base_dir)
            else:
                log.debug('Running text through basic conversion...')
                flow_size = getattr(options, 'flow_size', 0)
                html = convert_basic(txt, epub_split_size_kb=flow_size)
            # Run the HTMLized text through the html processing plugin.
            from calibre.customize.ui import plugin_for_input_format
            html_input = plugin_for_input_format('html')
            for opt in html_input.options:
                setattr(options, opt.option.name, opt.recommended_value)
            options.input_encoding = 'utf-8'
            htmlfile = self.shift_file('index.html', html.encode('utf-8'))
            odi = options.debug_pipeline
            options.debug_pipeline = None
            # Generate oeb from html conversion.
            oeb = html_input.convert(open(htmlfile, 'rb'), options, 'html', log, {})
            options.debug_pipeline = odi
        finally:
            for x in self.shifted_files:
                os.remove(x)
        # Set metadata from file.
        if input_mi is None:
            from calibre.customize.ui import get_file_type_metadata
            input_mi = get_file_type_metadata(stream, file_ext)
        from calibre.ebooks.oeb.transforms.metadata import meta_info_to_oeb_metadata
        meta_info_to_oeb_metadata(input_mi, oeb.metadata, log)
        self.html_postprocess_title = input_mi.title
        return oeb
    def postprocess_book(self, oeb, opts, log):
        for item in oeb.spine:
            if hasattr(item.data, 'xpath'):
                for title in item.data.xpath('//*[local-name()="title"]'):
                    if title.text == _('Unknown'):
                        title.text = self.html_postprocess_title
--- a/ebook_converter/ebooks/conversion/plugins/txt_output.py
+++ b/ebook_converter/ebooks/conversion/plugins/txt_output.py
@@ -0,0 +1,165 @@
 # -*- coding: utf-8 -*-
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL 3'
 __copyright__ = '2009, John Schember <john@nachtimwald.com>'
 __docformat__ = 'restructuredtext en'
 import os
 import shutil
 from calibre.customize.conversion import OutputFormatPlugin, \
    OptionRecommendation
 from calibre.ptempfile import TemporaryDirectory, TemporaryFile
 NEWLINE_TYPES = ['system', 'unix', 'old_mac', 'windows']
 class TXTOutput(OutputFormatPlugin):
    name = 'TXT Output'
    author = 'John Schember'
    file_type = 'txt'
    commit_name = 'txt_output'
    ui_data = {
            'newline_types': NEWLINE_TYPES,
            'formatting_types': {
                'plain': _('Plain text'),
                'markdown': _('Markdown formatted text'),
                'textile': _('TexTile formatted text')
            },
    }
    options = {
        OptionRecommendation(name='newline', recommended_value='system',
            level=OptionRecommendation.LOW,
            short_switch='n', choices=NEWLINE_TYPES,
            help=_('Type of newline to use. Options are %s. Default is \'system\'. '
                'Use \'old_mac\' for compatibility with Mac OS 9 and earlier. '
                'For macOS use \'unix\'. \'system\' will default to the newline '
                'type used by this OS.') % sorted(NEWLINE_TYPES)),
        OptionRecommendation(name='txt_output_encoding', recommended_value='utf-8',
            level=OptionRecommendation.LOW,
            help=_('Specify the character encoding of the output document. '
            'The default is utf-8.')),
        OptionRecommendation(name='inline_toc',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Add Table of Contents to beginning of the book.')),
        OptionRecommendation(name='max_line_length',
            recommended_value=0, level=OptionRecommendation.LOW,
            help=_('The maximum number of characters per line. This splits on '
            'the first space before the specified value. If no space is found '
            'the line will be broken at the space after and will exceed the '
            'specified value. Also, there is a minimum of 25 characters. '
            'Use 0 to disable line splitting.')),
        OptionRecommendation(name='force_max_line_length',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Force splitting on the max-line-length value when no space '
            'is present. Also allows max-line-length to be below the minimum')),
        OptionRecommendation(name='txt_output_formatting',
             recommended_value='plain',
             choices=list(ui_data['formatting_types']),
             help=_('Formatting used within the document.\n'
                    '* plain: {plain}\n'
                    '* markdown: {markdown}\n'
                    '* textile: {textile}').format(**ui_data['formatting_types'])),
        OptionRecommendation(name='keep_links',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Do not remove links within the document. This is only '
            'useful when paired with a txt-output-formatting option that '
            'is not none because links are always removed with plain text output.')),
        OptionRecommendation(name='keep_image_references',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Do not remove image references within the document. This is only '
            'useful when paired with a txt-output-formatting option that '
            'is not none because links are always removed with plain text output.')),
        OptionRecommendation(name='keep_color',
            recommended_value=False, level=OptionRecommendation.LOW,
            help=_('Do not remove font color from output. This is only useful when '
                   'txt-output-formatting is set to textile. Textile is the only '
                   'formatting that supports setting font color. If this option is '
                   'not specified font color will not be set and default to the '
                   'color displayed by the reader (generally this is black).')),
     }
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.txt.txtml import TXTMLizer
        from calibre.utils.cleantext import clean_ascii_chars
        from calibre.ebooks.txt.newlines import specified_newlines, TxtNewlines
        if opts.txt_output_formatting.lower() == 'markdown':
            from calibre.ebooks.txt.markdownml import MarkdownMLizer
            self.writer = MarkdownMLizer(log)
        elif opts.txt_output_formatting.lower() == 'textile':
            from calibre.ebooks.txt.textileml import TextileMLizer
            self.writer = TextileMLizer(log)
        else:
            self.writer = TXTMLizer(log)
        txt = self.writer.extract_content(oeb_book, opts)
        txt = clean_ascii_chars(txt)
        log.debug('\tReplacing newlines with selected type...')
        txt = specified_newlines(TxtNewlines(opts.newline).newline, txt)
        close = False
        if not hasattr(output_path, 'write'):
            close = True
            if not os.path.exists(os.path.dirname(output_path)) and os.path.dirname(output_path) != '':
                os.makedirs(os.path.dirname(output_path))
            out_stream = open(output_path, 'wb')
        else:
            out_stream = output_path
        out_stream.seek(0)
        out_stream.truncate()
        out_stream.write(txt.encode(opts.txt_output_encoding, 'replace'))
        if close:
            out_stream.close()
 class TXTZOutput(TXTOutput):
    name = 'TXTZ Output'
    author = 'John Schember'
    file_type = 'txtz'
    def convert(self, oeb_book, output_path, input_plugin, opts, log):
        from calibre.ebooks.oeb.base import OEB_IMAGES
        from calibre.utils.zipfile import ZipFile
        from lxml import etree
        with TemporaryDirectory('_txtz_output') as tdir:
            # TXT
            txt_name = 'index.txt'
            if opts.txt_output_formatting.lower() == 'textile':
                txt_name = 'index.text'
            with TemporaryFile(txt_name) as tf:
                TXTOutput.convert(self, oeb_book, tf, input_plugin, opts, log)
                shutil.copy(tf, os.path.join(tdir, txt_name))
            # Images
            for item in oeb_book.manifest:
                if item.media_type in OEB_IMAGES:
                    if hasattr(self.writer, 'images'):
                        path = os.path.join(tdir, 'images')
                        if item.href in self.writer.images:
                            href = self.writer.images[item.href]
                        else:
                            continue
                    else:
                        path = os.path.join(tdir, os.path.dirname(item.href))
                        href = os.path.basename(item.href)
                    if not os.path.exists(path):
                        os.makedirs(path)
                    with open(os.path.join(path, href), 'wb') as imgf:
                        imgf.write(item.data)
            # Metadata
            with open(os.path.join(tdir, 'metadata.opf'), 'wb') as mdataf:
                mdataf.write(etree.tostring(oeb_book.metadata.to_opf1()))
            txtz = ZipFile(output_path, 'w')
            txtz.add_dir(tdir)
--- a/ebook_converter/ebooks/conversion/plumber.py
+++ b/ebook_converter/ebooks/conversion/plumber.py
--- a/ebook_converter/ebooks/conversion/preprocess.py
+++ b/ebook_converter/ebooks/conversion/preprocess.py
@@ -0,0 +1,646 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import functools, re, json
 from math import ceil
 from calibre import entity_to_unicode, as_unicode
 from polyglot.builtins import unicode_type, range
 XMLDECL_RE    = re.compile(r'^\s*<[?]xml.*?[?]>')
 SVG_NS       = 'http://www.w3.org/2000/svg'
 XLINK_NS     = 'http://www.w3.org/1999/xlink'
 convert_entities = functools.partial(entity_to_unicode,
        result_exceptions={
            '<' : '&lt;',
            '>' : '&gt;',
            "'" : '&apos;',
            '"' : '&quot;',
            '&' : '&amp;',
        })
 _span_pat = re.compile('<span.*?</span>', re.DOTALL|re.IGNORECASE)
 LIGATURES = {
 #        '\u00c6': 'AE',
 #        '\u00e6': 'ae',
 #        '\u0152': 'OE',
 #        '\u0153': 'oe',
 #        '\u0132': 'IJ',
 #        '\u0133': 'ij',
 #        '\u1D6B': 'ue',
        '\uFB00': 'ff',
        '\uFB01': 'fi',
        '\uFB02': 'fl',
        '\uFB03': 'ffi',
        '\uFB04': 'ffl',
        '\uFB05': 'ft',
        '\uFB06': 'st',
        }
 _ligpat = re.compile('|'.join(LIGATURES))
 def sanitize_head(match):
    x = match.group(1)
    x = _span_pat.sub('', x)
    return '<head>\n%s\n</head>' % x
 def chap_head(match):
    chap = match.group('chap')
    title = match.group('title')
    if not title:
        return '<h1>'+chap+'</h1><br/>\n'
    else:
        return '<h1>'+chap+'</h1>\n<h3>'+title+'</h3>\n'
 def wrap_lines(match):
    ital = match.group('ital')
    if not ital:
        return ' '
    else:
        return ital+' '
 def smarten_punctuation(html, log=None):
    from calibre.utils.smartypants import smartyPants
    from calibre.ebooks.chardet import substitute_entites
    from calibre.ebooks.conversion.utils import HeuristicProcessor
    preprocessor = HeuristicProcessor(log=log)
    from uuid import uuid4
    start = 'calibre-smartypants-'+unicode_type(uuid4())
    stop = 'calibre-smartypants-'+unicode_type(uuid4())
    html = html.replace('<!--', start)
    html = html.replace('-->', stop)
    html = preprocessor.fix_nbsp_indents(html)
    html = smartyPants(html)
    html = html.replace(start, '<!--')
    html = html.replace(stop, '-->')
    return substitute_entites(html)
 class DocAnalysis(object):
    '''
    Provides various text analysis functions to determine how the document is structured.
    format is the type of document analysis will be done against.
    raw is the raw text to determine the line length to use for wrapping.
    Blank lines are excluded from analysis
    '''
    def __init__(self, format='html', raw=''):
        raw = raw.replace('&nbsp;', ' ')
        if format == 'html':
            linere = re.compile(r'(?<=<p)(?![^>]*>\s*</p>).*?(?=</p>)', re.DOTALL)
        elif format == 'pdf':
            linere = re.compile(r'(?<=<br>)(?!\s*<br>).*?(?=<br>)', re.DOTALL)
        elif format == 'spanned_html':
            linere = re.compile('(?<=<span).*?(?=</span>)', re.DOTALL)
        elif format == 'txt':
            linere = re.compile('.*?\n')
        self.lines = linere.findall(raw)
    def line_length(self, percent):
        '''
        Analyses the document to find the median line length.
        percentage is a decimal number, 0 - 1 which is used to determine
        how far in the list of line lengths to use. The list of line lengths is
        ordered smallest to largest and does not include duplicates. 0.5 is the
        median value.
        '''
        lengths = []
        for line in self.lines:
            if len(line) > 0:
                lengths.append(len(line))
        if not lengths:
            return 0
        lengths = list(set(lengths))
        total = sum(lengths)
        avg = total / len(lengths)
        max_line = ceil(avg * 2)
        lengths = sorted(lengths)
        for i in range(len(lengths) - 1, -1, -1):
            if lengths[i] > max_line:
                del lengths[i]
        if percent > 1:
            percent = 1
        if percent < 0:
            percent = 0
        index = int(len(lengths) * percent) - 1
        return lengths[index]
    def line_histogram(self, percent):
        '''
        Creates a broad histogram of the document to determine whether it incorporates hard
        line breaks.  Lines are sorted into 20 'buckets' based on length.
        percent is the percentage of lines that should be in a single bucket to return true
        The majority of the lines will exist in 1-2 buckets in typical docs with hard line breaks
        '''
        minLineLength=20  # Ignore lines under 20 chars (typical of spaces)
        maxLineLength=1900  # Discard larger than this to stay in range
        buckets=20  # Each line is divided into a bucket based on length
        # print("there are "+unicode_type(len(lines))+" lines")
        # max = 0
        # for line in self.lines:
        #    l = len(line)
        #    if l > max:
        #        max = l
        # print("max line found is "+unicode_type(max))
        # Build the line length histogram
        hRaw = [0 for i in range(0,buckets)]
        for line in self.lines:
            l = len(line)
            if l > minLineLength and l < maxLineLength:
                l = int(l // 100)
                # print("adding "+unicode_type(l))
                hRaw[l]+=1
        # Normalize the histogram into percents
        totalLines = len(self.lines)
        if totalLines > 0:
            h = [float(count)/totalLines for count in hRaw]
        else:
            h = []
        # print("\nhRaw histogram lengths are: "+unicode_type(hRaw))
        # print("              percents are: "+unicode_type(h)+"\n")
        # Find the biggest bucket
        maxValue = 0
        for i in range(0,len(h)):
            if h[i] > maxValue:
                maxValue = h[i]
        if maxValue < percent:
            # print("Line lengths are too variable. Not unwrapping.")
            return False
        else:
            # print(unicode_type(maxValue)+" of the lines were in one bucket")
            return True
 class Dehyphenator(object):
    '''
    Analyzes words to determine whether hyphens should be retained/removed.  Uses the document
    itself is as a dictionary. This method handles all languages along with uncommon, made-up, and
    scientific words. The primary disadvantage is that words appearing only once in the document
    retain hyphens.
    '''
    def __init__(self, verbose=0, log=None):
        self.log = log
        self.verbose = verbose
        # Add common suffixes to the regex below to increase the likelihood of a match -
        # don't add suffixes which are also complete words, such as 'able' or 'sex'
        # only remove if it's not already the point of hyphenation
        self.suffix_string = (
            "((ed)?ly|'?e?s||a?(t|s)?ion(s|al(ly)?)?|ings?|er|(i)?ous|"
            "(i|a)ty|(it)?ies|ive|gence|istic(ally)?|(e|a)nce|m?ents?|ism|ated|"
            "(e|u)ct(ed)?|ed|(i|ed)?ness|(e|a)ncy|ble|ier|al|ex|ian)$")
        self.suffixes = re.compile(r"^%s" % self.suffix_string, re.IGNORECASE)
        self.removesuffixes = re.compile(r"%s" % self.suffix_string, re.IGNORECASE)
        # remove prefixes if the prefix was not already the point of hyphenation
        self.prefix_string = '^(dis|re|un|in|ex)'
        self.prefixes = re.compile(r'%s$' % self.prefix_string, re.IGNORECASE)
        self.removeprefix = re.compile(r'%s' % self.prefix_string, re.IGNORECASE)
    def dehyphenate(self, match):
        firsthalf = match.group('firstpart')
        secondhalf = match.group('secondpart')
        try:
            wraptags = match.group('wraptags')
        except:
            wraptags = ''
        hyphenated = unicode_type(firsthalf) + "-" + unicode_type(secondhalf)
        dehyphenated = unicode_type(firsthalf) + unicode_type(secondhalf)
        if self.suffixes.match(secondhalf) is None:
            lookupword = self.removesuffixes.sub('', dehyphenated)
        else:
            lookupword = dehyphenated
        if len(firsthalf) > 4 and self.prefixes.match(firsthalf) is None:
            lookupword = self.removeprefix.sub('', lookupword)
        if self.verbose > 2:
            self.log("lookup word is: "+lookupword+", orig is: " + hyphenated)
        try:
            searchresult = self.html.find(lookupword.lower())
        except:
            return hyphenated
        if self.format == 'html_cleanup' or self.format == 'txt_cleanup':
            if self.html.find(lookupword) != -1 or searchresult != -1:
                if self.verbose > 2:
                    self.log("    Cleanup:returned dehyphenated word: " + dehyphenated)
                return dehyphenated
            elif self.html.find(hyphenated) != -1:
                if self.verbose > 2:
                    self.log("        Cleanup:returned hyphenated word: " + hyphenated)
                return hyphenated
            else:
                if self.verbose > 2:
                    self.log("            Cleanup:returning original text "+firsthalf+" + linefeed "+secondhalf)
                return firsthalf+'\u2014'+wraptags+secondhalf
        else:
            if self.format == 'individual_words' and len(firsthalf) + len(secondhalf) <= 6:
                if self.verbose > 2:
                    self.log("too short, returned hyphenated word: " + hyphenated)
                return hyphenated
            if len(firsthalf) <= 2 and len(secondhalf) <= 2:
                if self.verbose > 2:
                    self.log("too short, returned hyphenated word: " + hyphenated)
                return hyphenated
            if self.html.find(lookupword) != -1 or searchresult != -1:
                if self.verbose > 2:
                    self.log("     returned dehyphenated word: " + dehyphenated)
                return dehyphenated
            else:
                if self.verbose > 2:
                    self.log("          returned hyphenated word: " + hyphenated)
                return hyphenated
    def __call__(self, html, format, length=1):
        self.html = html
        self.format = format
        if format == 'html':
            intextmatch = re.compile((
                r'(?<=.{%i})(?P<firstpart>[^\W\-]+)(-|‐)\s*(?=<)(?P<wraptags>(</span>)?'
                r'\s*(</[iubp]>\s*){1,2}(?P<up2threeblanks><(p|div)[^>]*>\s*(<p[^>]*>\s*</p>\s*)'
                r'?</(p|div)>\s+){0,3}\s*(<[iubp][^>]*>\s*){1,2}(<span[^>]*>)?)\s*(?P<secondpart>[\w\d]+)') % length)
        elif format == 'pdf':
            intextmatch = re.compile((
                r'(?<=.{%i})(?P<firstpart>[^\W\-]+)(-|‐)\s*(?P<wraptags><p>|'
                r'</[iub]>\s*<p>\s*<[iub]>)\s*(?P<secondpart>[\w\d]+)')% length)
        elif format == 'txt':
            intextmatch = re.compile(
                '(?<=.{%i})(?P<firstpart>[^\\W\\-]+)(-|‐)(\u0020|\u0009)*(?P<wraptags>(\n(\u0020|\u0009)*)+)(?P<secondpart>[\\w\\d]+)'% length)
        elif format == 'individual_words':
            intextmatch = re.compile(
                r'(?!<)(?P<firstpart>[^\W\-]+)(-|‐)\s*(?P<secondpart>\w+)(?![^<]*?>)', re.UNICODE)
        elif format == 'html_cleanup':
            intextmatch = re.compile(
                r'(?P<firstpart>[^\W\-]+)(-|‐)\s*(?=<)(?P<wraptags></span>\s*(</[iubp]>'
                r'\s*<[iubp][^>]*>\s*)?<span[^>]*>|</[iubp]>\s*<[iubp][^>]*>)?\s*(?P<secondpart>[\w\d]+)')
        elif format == 'txt_cleanup':
            intextmatch = re.compile(
                r'(?P<firstpart>[^\W\-]+)(-|‐)(?P<wraptags>\s+)(?P<secondpart>[\w\d]+)')
        html = intextmatch.sub(self.dehyphenate, html)
        return html
 class CSSPreProcessor(object):
    # Remove some of the broken CSS Microsoft products
    # create
    MS_PAT     = re.compile(r'''
        (?P<start>^|;|\{)\s*    # The end of the previous rule or block start
        (%s).+?                 # The invalid selectors
        (?P<end>$|;|\})         # The end of the declaration
        '''%'mso-|panose-|text-underline|tab-interval',
        re.MULTILINE|re.IGNORECASE|re.VERBOSE)
    def ms_sub(self, match):
        end = match.group('end')
        try:
            start = match.group('start')
        except:
            start = ''
        if end == ';':
            end = ''
        return start + end
    def __call__(self, data, add_namespace=False):
        from calibre.ebooks.oeb.base import XHTML_CSS_NAMESPACE
        data = self.MS_PAT.sub(self.ms_sub, data)
        if not add_namespace:
            return data
        # Remove comments as the following namespace logic will break if there
        # are commented lines before the first @import or @charset rule. Since
        # the conversion will remove all stylesheets anyway, we don't lose
        # anything
        data = re.sub(unicode_type(r'/\*.*?\*/'), '', data, flags=re.DOTALL)
        ans, namespaced = [], False
        for line in data.splitlines():
            ll = line.lstrip()
            if not (namespaced or ll.startswith('@import') or not ll or
                        ll.startswith('@charset')):
                ans.append(XHTML_CSS_NAMESPACE.strip())
                namespaced = True
            ans.append(line)
        return '\n'.join(ans)
 def accent_regex(accent_maps, letter_before=False):
    accent_cat = set()
    letters = set()
    for accent in tuple(accent_maps):
        accent_cat.add(accent)
        k, v = accent_maps[accent].split(':', 1)
        if len(k) != len(v):
            raise ValueError('Invalid mapping for: {} -> {}'.format(k, v))
        accent_maps[accent] = lmap = dict(zip(k, v))
        letters |= set(lmap)
    if letter_before:
        args = ''.join(letters), ''.join(accent_cat)
        accent_group, letter_group = 2, 1
    else:
        args = ''.join(accent_cat), ''.join(letters)
        accent_group, letter_group = 1, 2
    pat = re.compile(r'([{}])\s*(?:<br[^>]*>){{0,1}}\s*([{}])'.format(*args), re.UNICODE)
    def sub(m):
        lmap = accent_maps[m.group(accent_group)]
        return lmap.get(m.group(letter_group)) or m.group()
    return pat, sub
 def html_preprocess_rules():
    ans = getattr(html_preprocess_rules, 'ans', None)
    if ans is None:
        ans = html_preprocess_rules.ans = [
        # Remove huge block of contiguous spaces as they slow down
        # the following regexes pretty badly
        (re.compile(r'\s{10000,}'), ''),
        # Some idiotic HTML generators (Frontpage I'm looking at you)
        # Put all sorts of crap into <head>. This messes up lxml
        (re.compile(r'<head[^>]*>\n*(.*?)\n*</head>', re.IGNORECASE|re.DOTALL),
        sanitize_head),
        # Convert all entities, since lxml doesn't handle them well
        (re.compile(r'&(\S+?);'), convert_entities),
        # Remove the <![if/endif tags inserted by everybody's darling, MS Word
        (re.compile(r'</{0,1}!\[(end){0,1}if\]{0,1}>', re.IGNORECASE), ''),
    ]
    return ans
 def pdftohtml_rules():
    ans = getattr(pdftohtml_rules, 'ans', None)
    if ans is None:
        ans = pdftohtml_rules.ans = [
        accent_regex({
            '¨': 'aAeEiIoOuU:äÄëËïÏöÖüÜ',
            '`': 'aAeEiIoOuU:àÀèÈìÌòÒùÙ',
            '´': 'aAcCeEiIlLoOnNrRsSuUzZ:áÁćĆéÉíÍĺĹóÓńŃŕŔśŚúÚźŹ',
            'ˆ': 'aAeEiIoOuU:âÂêÊîÎôÔûÛ',
            '¸': 'cC:çÇ',
            '˛': 'aAeE:ąĄęĘ',
            '˙': 'zZ:żŻ',
            'ˇ': 'cCdDeElLnNrRsStTzZ:čČďĎěĚľĽňŇřŘšŠťŤžŽ',
            '°': 'uU:ůŮ',
        }),
        accent_regex({'`': 'aAeEiIoOuU:àÀèÈìÌòÒùÙ'}, letter_before=True),
        # If pdf printed from a browser then the header/footer has a reliable pattern
        (re.compile(r'((?<=</a>)\s*file:/{2,4}[A-Z].*<br>|file:////?[A-Z].*<br>(?=\s*<hr>))', re.IGNORECASE), lambda match: ''),
        # Center separator lines
        (re.compile(r'<br>\s*(?P<break>([*#•✦=] *){3,})\s*<br>'), lambda match: '<p>\n<p style="text-align:center">' + match.group('break') + '</p>'),
        # Remove <hr> tags
        (re.compile(r'<hr.*?>', re.IGNORECASE), ''),
        # Remove gray background
        (re.compile(r'<BODY[^<>]+>'), '<BODY>'),
        # Convert line breaks to paragraphs
        (re.compile(r'<br[^>]*>\s*'), '</p>\n<p>'),
        (re.compile(r'<body[^>]*>\s*'), '<body>\n<p>'),
        (re.compile(r'\s*</body>'), '</p>\n</body>'),
        # Clean up spaces
        (re.compile(r'(?<=[\.,;\?!”"\'])[\s^ ]*(?=<)'), ' '),
        # Add space before and after italics
        (re.compile(r'(?<!“)<i>'), ' <i>'),
        (re.compile(r'</i>(?=\w)'), '</i> '),
    ]
    return ans
 def book_designer_rules():
    ans = getattr(book_designer_rules, 'ans', None)
    if ans is None:
        ans = book_designer_rules.ans = [
        # HR
        (re.compile('<hr>', re.IGNORECASE),
        lambda match : '<span style="page-break-after:always"> </span>'),
        # Create header tags
        (re.compile(r'<h2[^><]*?id=BookTitle[^><]*?(align=)*(?(1)(\w+))*[^><]*?>[^><]*?</h2>', re.IGNORECASE),
        lambda match : '<h1 id="BookTitle" align="%s">%s</h1>'%(match.group(2) if match.group(2) else 'center', match.group(3))),
        (re.compile(r'<h2[^><]*?id=BookAuthor[^><]*?(align=)*(?(1)(\w+))*[^><]*?>[^><]*?</h2>', re.IGNORECASE),
        lambda match : '<h2 id="BookAuthor" align="%s">%s</h2>'%(match.group(2) if match.group(2) else 'center', match.group(3))),
        (re.compile('<span[^><]*?id=title[^><]*?>(.*?)</span>', re.IGNORECASE|re.DOTALL),
        lambda match : '<h2 class="title">%s</h2>'%(match.group(1),)),
        (re.compile('<span[^><]*?id=subtitle[^><]*?>(.*?)</span>', re.IGNORECASE|re.DOTALL),
        lambda match : '<h3 class="subtitle">%s</h3>'%(match.group(1),)),
    ]
    return None
 class HTMLPreProcessor(object):
    def __init__(self, log=None, extra_opts=None, regex_wizard_callback=None):
        self.log = log
        self.extra_opts = extra_opts
        self.regex_wizard_callback = regex_wizard_callback
        self.current_href = None
    def is_baen(self, src):
        return re.compile(r'<meta\s+name="Publisher"\s+content=".*?Baen.*?"',
                          re.IGNORECASE).search(src) is not None
    def is_book_designer(self, raw):
        return re.search('<H2[^><]*id=BookTitle', raw) is not None
    def is_pdftohtml(self, src):
        return '<!-- created by calibre\'s pdftohtml -->' in src[:1000]
    def __call__(self, html, remove_special_chars=None,
            get_preprocess_html=False):
        if remove_special_chars is not None:
            html = remove_special_chars.sub('', html)
        html = html.replace('\0', '')
        is_pdftohtml = self.is_pdftohtml(html)
        if self.is_baen(html):
            rules = []
        elif self.is_book_designer(html):
            rules = book_designer_rules()
        elif is_pdftohtml:
            rules = pdftohtml_rules()
        else:
            rules = []
        start_rules = []
        if not getattr(self.extra_opts, 'keep_ligatures', False):
            html = _ligpat.sub(lambda m:LIGATURES[m.group()], html)
        user_sr_rules = {}
        # Function for processing search and replace
        def do_search_replace(search_pattern, replace_txt):
            from calibre.ebooks.conversion.search_replace import compile_regular_expression
            try:
                search_re = compile_regular_expression(search_pattern)
                if not replace_txt:
                    replace_txt = ''
                rules.insert(0, (search_re, replace_txt))
                user_sr_rules[(search_re, replace_txt)] = search_pattern
            except Exception as e:
                self.log.error('Failed to parse %r regexp because %s' %
                        (search, as_unicode(e)))
        # search / replace using the sr?_search / sr?_replace options
        for i in range(1, 4):
            search, replace = 'sr%d_search'%i, 'sr%d_replace'%i
            search_pattern = getattr(self.extra_opts, search, '')
            replace_txt = getattr(self.extra_opts, replace, '')
            if search_pattern:
                do_search_replace(search_pattern, replace_txt)
        # multi-search / replace using the search_replace option
        search_replace = getattr(self.extra_opts, 'search_replace', None)
        if search_replace:
            search_replace = json.loads(search_replace)
            for search_pattern, replace_txt in reversed(search_replace):
                do_search_replace(search_pattern, replace_txt)
        end_rules = []
        # delete soft hyphens - moved here so it's executed after header/footer removal
        if is_pdftohtml:
            # unwrap/delete soft hyphens
            end_rules.append((re.compile(
                r'[](</p>\s*<p>\s*)+\s*(?=[\[a-z\d])'), lambda match: ''))
            # unwrap/delete soft hyphens with formatting
            end_rules.append((re.compile(
                r'[]\s*(</(i|u|b)>)+(</p>\s*<p>\s*)+\s*(<(i|u|b)>)+\s*(?=[\[a-z\d])'), lambda match: ''))
        length = -1
        if getattr(self.extra_opts, 'unwrap_factor', 0.0) > 0.01:
            docanalysis = DocAnalysis('pdf', html)
            length = docanalysis.line_length(getattr(self.extra_opts, 'unwrap_factor'))
            if length:
                # print("The pdf line length returned is " + unicode_type(length))
                # unwrap em/en dashes
                end_rules.append((re.compile(
                    r'(?<=.{%i}[–—])\s*<p>\s*(?=[\[a-z\d])' % length), lambda match: ''))
                end_rules.append(
                    # Un wrap using punctuation
                    (re.compile((
                        r'(?<=.{%i}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\\IAß]'
                        r'|(?<!\&\w{4});))\s*(?P<ital></(i|b|u)>)?\s*(</p>\s*<p>\s*)+\s*(?=(<(i|b|u)>)?'
                        r'\s*[\w\d$(])') % length, re.UNICODE), wrap_lines),
                )
        for rule in html_preprocess_rules() + start_rules:
            html = rule[0].sub(rule[1], html)
        if self.regex_wizard_callback is not None:
            self.regex_wizard_callback(self.current_href, html)
        if get_preprocess_html:
            return html
        def dump(raw, where):
            import os
            dp = getattr(self.extra_opts, 'debug_pipeline', None)
            if dp and os.path.exists(dp):
                odir = os.path.join(dp, 'input')
                if os.path.exists(odir):
                    odir = os.path.join(odir, where)
                    if not os.path.exists(odir):
                        os.makedirs(odir)
                    name, i = None, 0
                    while not name or os.path.exists(os.path.join(odir, name)):
                        i += 1
                        name = '%04d.html'%i
                    with open(os.path.join(odir, name), 'wb') as f:
                        f.write(raw.encode('utf-8'))
        # dump(html, 'pre-preprocess')
        for rule in rules + end_rules:
            try:
                html = rule[0].sub(rule[1], html)
            except Exception as e:
                if rule in user_sr_rules:
                    self.log.error(
                        'User supplied search & replace rule: %s -> %s '
                        'failed with error: %s, ignoring.'%(
                            user_sr_rules[rule], rule[1], e))
                else:
                    raise
        if is_pdftohtml and length > -1:
            # Dehyphenate
            dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
            html = dehyphenator(html,'html', length)
        if is_pdftohtml:
            from calibre.ebooks.conversion.utils import HeuristicProcessor
            pdf_markup = HeuristicProcessor(self.extra_opts, None)
            totalwords = 0
            if pdf_markup.get_word_count(html) > 7000:
                html = pdf_markup.markup_chapters(html, totalwords, True)
        # dump(html, 'post-preprocess')
        # Handle broken XHTML w/ SVG (ugh)
        if 'svg:' in html and SVG_NS not in html:
            html = html.replace(
                '<html', '<html xmlns:svg="%s"' % SVG_NS, 1)
        if 'xlink:' in html and XLINK_NS not in html:
            html = html.replace(
                '<html', '<html xmlns:xlink="%s"' % XLINK_NS, 1)
        html = XMLDECL_RE.sub('', html)
        if getattr(self.extra_opts, 'asciiize', False):
            from calibre.utils.localization import get_udc
            from calibre.utils.mreplace import MReplace
            unihandecoder = get_udc()
            mr = MReplace(data={'«':'&lt;'*3, '»':'&gt;'*3})
            html = mr.mreplace(html)
            html = unihandecoder.decode(html)
        if getattr(self.extra_opts, 'enable_heuristics', False):
            from calibre.ebooks.conversion.utils import HeuristicProcessor
            preprocessor = HeuristicProcessor(self.extra_opts, self.log)
            html = preprocessor(html)
        if is_pdftohtml:
            html = html.replace('<!-- created by calibre\'s pdftohtml -->', '')
        if getattr(self.extra_opts, 'smarten_punctuation', False):
            html = smarten_punctuation(html, self.log)
        try:
            unsupported_unicode_chars = self.extra_opts.output_profile.unsupported_unicode_chars
        except AttributeError:
            unsupported_unicode_chars = ''
        if unsupported_unicode_chars:
            from calibre.utils.localization import get_udc
            unihandecoder = get_udc()
            for char in unsupported_unicode_chars:
                asciichar = unihandecoder.decode(char)
                html = html.replace(char, asciichar)
        return html
--- a/ebook_converter/ebooks/conversion/utils.py
+++ b/ebook_converter/ebooks/conversion/utils.py
@@ -0,0 +1,881 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2010, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import re
 from math import ceil
 from calibre.ebooks.conversion.preprocess import DocAnalysis, Dehyphenator
 from calibre.utils.logging import default_log
 from calibre.utils.wordcount import get_wordcount_obj
 from polyglot.builtins import unicode_type
 class HeuristicProcessor(object):
    def __init__(self, extra_opts=None, log=None):
        self.log = default_log if log is None else log
        self.html_preprocess_sections = 0
        self.found_indents = 0
        self.extra_opts = extra_opts
        self.deleted_nbsps = False
        self.totalwords = 0
        self.min_chapters = 1
        self.chapters_no_title = 0
        self.chapters_with_title = 0
        self.blanks_deleted = False
        self.blanks_between_paragraphs = False
        self.linereg = re.compile('(?<=<p).*?(?=</p>)', re.IGNORECASE|re.DOTALL)
        self.blankreg = re.compile(r'\s*(?P<openline><p(?!\sclass=\"(softbreak|whitespace)\")[^>]*>)\s*(?P<closeline></p>)', re.IGNORECASE)
        self.anyblank = re.compile(r'\s*(?P<openline><p[^>]*>)\s*(?P<closeline></p>)', re.IGNORECASE)
        self.multi_blank = re.compile(r'(\s*<p[^>]*>\s*</p>(\s*<div[^>]*>\s*</div>\s*)*){2,}(?!\s*<h\d)', re.IGNORECASE)
        self.any_multi_blank = re.compile(r'(\s*<p[^>]*>\s*</p>(\s*<div[^>]*>\s*</div>\s*)*){2,}', re.IGNORECASE)
        self.line_open = (
            r"<(?P<outer>p|div)[^>]*>\s*(<(?P<inner1>font|span|[ibu])[^>]*>)?\s*"
            r"(<(?P<inner2>font|span|[ibu])[^>]*>)?\s*(<(?P<inner3>font|span|[ibu])[^>]*>)?\s*")
        self.line_close = "(</(?P=inner3)>)?\\s*(</(?P=inner2)>)?\\s*(</(?P=inner1)>)?\\s*</(?P=outer)>"
        self.single_blank = re.compile(r'(\s*<(p|div)[^>]*>\s*</(p|div)>)', re.IGNORECASE)
        self.scene_break_open = '<p class="scenebreak" style="text-align:center; text-indent:0%; margin-top:1em; margin-bottom:1em; page-break-before:avoid">'
        self.common_in_text_endings = '[\"\'—’”,\\.!\\?\\…\\)„\\w]'
        self.common_in_text_beginnings = '[\\w\'\"“‘‛]'
    def is_pdftohtml(self, src):
        return '<!-- created by calibre\'s pdftohtml -->' in src[:1000]
    def is_abbyy(self, src):
        return '<meta name="generator" content="ABBYY FineReader' in src[:1000]
    def chapter_head(self, match):
        from calibre.utils.html2text import html2text
        chap = match.group('chap')
        title = match.group('title')
        if not title:
            self.html_preprocess_sections = self.html_preprocess_sections + 1
            self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
                    " chapters. - " + unicode_type(chap))
            return '<h2>'+chap+'</h2>\n'
        else:
            delete_whitespace = re.compile('^\\s*(?P<c>.*?)\\s*$')
            delete_quotes = re.compile('\'\"')
            txt_chap = delete_quotes.sub('', delete_whitespace.sub('\\g<c>', html2text(chap)))
            txt_title = delete_quotes.sub('', delete_whitespace.sub('\\g<c>', html2text(title)))
            self.html_preprocess_sections = self.html_preprocess_sections + 1
            self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
                    " chapters & titles. - " + unicode_type(chap) + ", " + unicode_type(title))
            return '<h2 title="'+txt_chap+', '+txt_title+'">'+chap+'</h2>\n<h3 class="sigilNotInTOC">'+title+'</h3>\n'
    def chapter_break(self, match):
        chap = match.group('section')
        styles = match.group('styles')
        self.html_preprocess_sections = self.html_preprocess_sections + 1
        self.log.debug("marked " + unicode_type(self.html_preprocess_sections) +
                " section markers based on punctuation. - " + unicode_type(chap))
        return '<'+styles+' style="page-break-before:always">'+chap
    def analyze_title_matches(self, match):
        # chap = match.group('chap')
        title = match.group('title')
        if not title:
            self.chapters_no_title = self.chapters_no_title + 1
        else:
            self.chapters_with_title = self.chapters_with_title + 1
    def insert_indent(self, match):
        pstyle = match.group('formatting')
        tag = match.group('tagtype')
        span = match.group('span')
        self.found_indents = self.found_indents + 1
        if pstyle:
            if pstyle.lower().find('style') != -1:
                pstyle = re.sub(r'"$', '; text-indent:3%"', pstyle)
            else:
                pstyle = pstyle+' style="text-indent:3%"'
            if not span:
                return '<'+tag+' '+pstyle+'>'
            else:
                return '<'+tag+' '+pstyle+'>'+span
        else:
            if not span:
                return '<'+tag+' style="text-indent:3%">'
            else:
                return '<'+tag+' style="text-indent:3%">'+span
    def no_markup(self, raw, percent):
        '''
        Detects total marked up line endings in the file. raw is the text to
        inspect.  Percent is the minimum percent of line endings which should
        be marked up to return true.
        '''
        htm_end_ere = re.compile('</(p|div)>', re.DOTALL)
        line_end_ere = re.compile('(\n|\r|\r\n)', re.DOTALL)
        htm_end = htm_end_ere.findall(raw)
        line_end = line_end_ere.findall(raw)
        tot_htm_ends = len(htm_end)
        tot_ln_fds = len(line_end)
        # self.log.debug("There are " + unicode_type(tot_ln_fds) + " total Line feeds, and " +
        #        unicode_type(tot_htm_ends) + " marked up endings")
        if percent > 1:
            percent = 1
        if percent < 0:
            percent = 0
        min_lns = tot_ln_fds * percent
        # self.log.debug("There must be fewer than " + unicode_type(min_lns) + " unmarked lines to add markup")
        return min_lns > tot_htm_ends
    def dump(self, raw, where):
        import os
        dp = getattr(self.extra_opts, 'debug_pipeline', None)
        if dp and os.path.exists(dp):
            odir = os.path.join(dp, 'preprocess')
            if not os.path.exists(odir):
                os.makedirs(odir)
            if os.path.exists(odir):
                odir = os.path.join(odir, where)
                if not os.path.exists(odir):
                    os.makedirs(odir)
                name, i = None, 0
                while not name or os.path.exists(os.path.join(odir, name)):
                    i += 1
                    name = '%04d.html'%i
                with open(os.path.join(odir, name), 'wb') as f:
                    f.write(raw.encode('utf-8'))
    def get_word_count(self, html):
        word_count_text = re.sub(r'(?s)<head[^>]*>.*?</head>', '', html)
        word_count_text = re.sub(r'<[^>]*>', '', word_count_text)
        wordcount = get_wordcount_obj(word_count_text)
        return wordcount.words
    def markup_italicis(self, html):
        # self.log.debug("\n\n\nitalicize debugging \n\n\n")
        ITALICIZE_WORDS = [
            'Etc.', 'etc.', 'viz.', 'ie.', 'i.e.', 'Ie.', 'I.e.', 'eg.',
            'e.g.', 'Eg.', 'E.g.', 'et al.', 'et cetera', 'n.b.', 'N.b.',
            'nota bene', 'Nota bene', 'Ste.', 'Mme.', 'Mdme.',
            'Mlle.', 'Mons.', 'PS.', 'PPS.',
        ]
        ITALICIZE_STYLE_PATS = [
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])_\*/(?P<words>[^\*_]+)/\*_'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])~~(?P<words>[^~]+)~~'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])_/(?P<words>[^/_]+)/_'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])_\*(?P<words>[^\*_]+)\*_'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])\*/(?P<words>[^/\*]+)/\*'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])/:(?P<words>[^:/]+):/'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])\|:(?P<words>[^:\|]+):\|'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])\*(?P<words>[^\*]+)\*'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])~(?P<words>[^~]+)~'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])/(?P<words>[^/\*><]+)/'),
            unicode_type(r'(?msu)(?<=[\s>"“\'‘])_(?P<words>[^_]+)_'),
        ]
        for word in ITALICIZE_WORDS:
            html = re.sub(r'(?<=\s|>)' + re.escape(word) + r'(?=\s|<)', '<i>%s</i>' % word, html)
        search_text = re.sub(r'(?s)<head[^>]*>.*?</head>', '', html)
        search_text = re.sub(r'<[^>]*>', '', search_text)
        for pat in ITALICIZE_STYLE_PATS:
            for match in re.finditer(pat, search_text):
                ital_string = unicode_type(match.group('words'))
                # self.log.debug("italicising "+unicode_type(match.group(0))+"    with <i>"+ital_string+"</i>")
                try:
                    html = re.sub(re.escape(unicode_type(match.group(0))), '<i>%s</i>' % ital_string, html)
                except OverflowError:
                    # match.group(0) was too large to be compiled into a regex
                    continue
                except re.error:
                    # the match was not a valid regular expression
                    continue
        return html
    def markup_chapters(self, html, wordcount, blanks_between_paragraphs):
        '''
        Searches for common chapter headings throughout the document
        attempts multiple patterns based on likelihood of a match
        with minimum false positives.  Exits after finding a successful pattern
        '''
        # Typical chapters are between 2000 and 7000 words, use the larger number to decide the
        # minimum of chapters to search for.  A max limit is calculated to prevent things like OCR
        # or pdf page numbers from being treated as TOC markers
        max_chapters = 150
        typical_chapters = 7000.
        if wordcount > 7000:
            if wordcount > 200000:
                typical_chapters = 15000.
            self.min_chapters = int(ceil(wordcount / typical_chapters))
        self.log.debug("minimum chapters required are: "+unicode_type(self.min_chapters))
        heading = re.compile('<h[1-3][^>]*>', re.IGNORECASE)
        self.html_preprocess_sections = len(heading.findall(html))
        self.log.debug("found " + unicode_type(self.html_preprocess_sections) + " pre-existing headings")
        # Build the Regular Expressions in pieces
        init_lookahead = "(?=<(p|div))"
        chapter_line_open = self.line_open
        title_line_open = (r"<(?P<outer2>p|div)[^>]*>\s*(<(?P<inner4>font|span|[ibu])[^>]*>)?"
        r"\s*(<(?P<inner5>font|span|[ibu])[^>]*>)?\s*(<(?P<inner6>font|span|[ibu])[^>]*>)?\s*")
        chapter_header_open = r"(?P<chap>"
        title_header_open = r"(?P<title>"
        chapter_header_close = ")\\s*"
        title_header_close = ")"
        chapter_line_close = self.line_close
        title_line_close = "(</(?P=inner6)>)?\\s*(</(?P=inner5)>)?\\s*(</(?P=inner4)>)?\\s*</(?P=outer2)>"
        is_pdftohtml = self.is_pdftohtml(html)
        if is_pdftohtml:
            title_line_open = "<(?P<outer2>p)[^>]*>\\s*"
            title_line_close = "\\s*</(?P=outer2)>"
        if blanks_between_paragraphs:
            blank_lines = "(\\s*<p[^>]*>\\s*</p>){0,2}\\s*"
        else:
            blank_lines = ""
        opt_title_open = "("
        opt_title_close = ")?"
        n_lookahead_open = "(?!\\s*"
        n_lookahead_close = ")\\s*"
        default_title = r"(<[ibu][^>]*>)?\s{0,3}(?!Chapter)([\w\:\'’\"-]+\s{0,3}){1,5}?(</[ibu][^>]*>)?(?=<)"
        simple_title = r"(<[ibu][^>]*>)?\s{0,3}(?!(Chapter|\s+<)).{0,65}?(</[ibu][^>]*>)?(?=<)"
        analysis_result = []
        chapter_types = [
            [(
                r"[^'\"]?(Introduction|Synopsis|Acknowledgements|Epilogue|CHAPTER|Kapitel|Volume\b|Prologue|Book\b|Part\b|Dedication|Preface)"
                r"\s*([\d\w-]+\:?\'?\s*){0,5}"), True, True, True, False, "Searching for common section headings", 'common'],
            # Highest frequency headings which include titles
            [r"[^'\"]?(CHAPTER|Kapitel)\s*([\dA-Z\-\'\"\?!#,]+\s*){0,7}\s*", True, True, True, False, "Searching for most common chapter headings", 'chapter'],
            [r"<b[^>]*>\s*(<span[^>]*>)?\s*(?!([*#•=]+\s*)+)(\s*(?=[\d.\w#\-*\s]+<)([\d.\w#-*]+\s*){1,5}\s*)(?!\.)(</span>)?\s*</b>",
                           True, True, True, False, "Searching for emphasized lines", 'emphasized'],  # Emphasized lines
            [r"[^'\"]?(\d+(\.|:))\s*([\w\-\'\"#,]+\s*){0,7}\s*", True, True, True, False,
                       "Searching for numeric chapter headings", 'numeric'],  # Numeric Chapters
            [r"([A-Z]\s+){3,}\s*([\d\w-]+\s*){0,3}\s*", True, True, True, False, "Searching for letter spaced headings", 'letter_spaced'],  # Spaced Lettering
            [r"[^'\"]?(\d+\.?\s+([\d\w-]+\:?\'?-?\s?){0,5})\s*", True, True, True, False,
                       "Searching for numeric chapters with titles", 'numeric_title'],  # Numeric Titles
            [r"[^'\"]?(\d+)\s*([\dA-Z\-\'\"\?!#,]+\s*){0,7}\s*", True, True, True, False,
                       "Searching for simple numeric headings", 'plain_number'],  # Numeric Chapters, no dot or colon
            [r"\s*[^'\"]?([A-Z#]+(\s|-){0,3}){1,5}\s*", False, True, False, False,
                          "Searching for chapters with Uppercase Characters", 'uppercase']  # Uppercase Chapters
            ]
        def recurse_patterns(html, analyze):
            # Start with most typical chapter headings, get more aggressive until one works
            for [chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name] in chapter_types:
                n_lookahead = ''
                hits = 0
                self.chapters_no_title = 0
                self.chapters_with_title = 0
                if n_lookahead_req:
                    lp_n_lookahead_open = n_lookahead_open
                    lp_n_lookahead_close = n_lookahead_close
                else:
                    lp_n_lookahead_open = ''
                    lp_n_lookahead_close = ''
                if strict_title:
                    lp_title = default_title
                else:
                    lp_title = simple_title
                if ignorecase:
                    arg_ignorecase = r'(?i)'
                else:
                    arg_ignorecase = ''
                if title_req:
                    lp_opt_title_open = ''
                    lp_opt_title_close = ''
                else:
                    lp_opt_title_open = opt_title_open
                    lp_opt_title_close = opt_title_close
                if self.html_preprocess_sections >= self.min_chapters:
                    break
                full_chapter_line = chapter_line_open+chapter_header_open+chapter_type+chapter_header_close+chapter_line_close
                if n_lookahead_req:
                    n_lookahead = re.sub("(ou|in|cha)", "lookahead_", full_chapter_line)
                if not analyze:
                    self.log.debug("Marked " + unicode_type(self.html_preprocess_sections) + " headings, " + log_message)
                chapter_marker = arg_ignorecase+init_lookahead+full_chapter_line+blank_lines+lp_n_lookahead_open+n_lookahead+lp_n_lookahead_close+ \
                    lp_opt_title_open+title_line_open+title_header_open+lp_title+title_header_close+title_line_close+lp_opt_title_close
                chapdetect = re.compile(r'%s' % chapter_marker)
                if analyze:
                    hits = len(chapdetect.findall(html))
                    if hits:
                        chapdetect.sub(self.analyze_title_matches, html)
                        if float(self.chapters_with_title) / float(hits) > .5:
                            title_req = True
                            strict_title = False
                        self.log.debug(
                                unicode_type(type_name)+" had "+unicode_type(hits)+
                                " hits - "+unicode_type(self.chapters_no_title)+" chapters with no title, "+
                                unicode_type(self.chapters_with_title)+" chapters with titles, "+
                                unicode_type(float(self.chapters_with_title) / float(hits))+" percent. ")
                        if type_name == 'common':
                            analysis_result.append([chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name])
                        elif self.min_chapters <= hits < max_chapters or self.min_chapters < 3 > hits:
                            analysis_result.append([chapter_type, n_lookahead_req, strict_title, ignorecase, title_req, log_message, type_name])
                            break
                else:
                    html = chapdetect.sub(self.chapter_head, html)
            return html
        recurse_patterns(html, True)
        chapter_types = analysis_result
        html = recurse_patterns(html, False)
        words_per_chptr = wordcount
        if words_per_chptr > 0 and self.html_preprocess_sections > 0:
            words_per_chptr = wordcount // self.html_preprocess_sections
        self.log.debug("Total wordcount is: "+ unicode_type(wordcount)+", Average words per section is: "+
                       unicode_type(words_per_chptr)+", Marked up "+unicode_type(self.html_preprocess_sections)+" chapters")
        return html
    def punctuation_unwrap(self, length, content, format):
        '''
        Unwraps lines based on line length and punctuation
        supports a range of html markup and text files
        the lookahead regex below is meant look for any non-full stop characters - punctuation
        characters which can be used as a full stop should *not* be added below - e.g. ?!“”. etc
        the reason for this is to prevent false positive wrapping.  False positives are more
        difficult to detect than false negatives during a manual review of the doc
        This function intentionally leaves hyphenated content alone as that is handled by the
        dehyphenate routine in a separate step
        '''
        def style_unwrap(match):
            style_close = match.group('style_close')
            style_open = match.group('style_open')
            if style_open and style_close:
                return style_close+' '+style_open
            elif style_open and not style_close:
                return ' '+style_open
            elif not style_open and style_close:
                return style_close+' '
            else:
                return ' '
        # define the pieces of the regex
        # (?<!\&\w{4});) is a semicolon not part of an entity
        lookahead = "(?<=.{"+unicode_type(length)+r"}([a-zა-ჰäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\\IAß]|(?<!\&\w{4});))"
        em_en_lookahead = "(?<=.{"+unicode_type(length)+"}[\u2013\u2014])"
        soft_hyphen = "\xad"
        line_ending = "\\s*(?P<style_close></(span|[iub])>)?\\s*(</(p|div)>)?"
        blanklines = "\\s*(?P<up2threeblanks><(p|span|div)[^>]*>\\s*(<(p|span|div)[^>]*>\\s*</(span|p|div)>\\s*)</(span|p|div)>\\s*){0,3}\\s*"
        line_opening = "<(p|div)[^>]*>\\s*(?P<style_open><(span|[iub])[^>]*>)?\\s*"
        txt_line_wrap = "((\u0020|\u0009)*\n){1,4}"
        if format == 'txt':
            unwrap_regex = lookahead+txt_line_wrap
            em_en_unwrap_regex = em_en_lookahead+txt_line_wrap
            shy_unwrap_regex = soft_hyphen+txt_line_wrap
        else:
            unwrap_regex = lookahead+line_ending+blanklines+line_opening
            em_en_unwrap_regex = em_en_lookahead+line_ending+blanklines+line_opening
            shy_unwrap_regex = soft_hyphen+line_ending+blanklines+line_opening
        unwrap = re.compile("%s" % unwrap_regex, re.UNICODE)
        em_en_unwrap = re.compile("%s" % em_en_unwrap_regex, re.UNICODE)
        shy_unwrap = re.compile("%s" % shy_unwrap_regex, re.UNICODE)
        if format == 'txt':
            content = unwrap.sub(' ', content)
            content = em_en_unwrap.sub('', content)
            content = shy_unwrap.sub('', content)
        else:
            content = unwrap.sub(style_unwrap, content)
            content = em_en_unwrap.sub(style_unwrap, content)
            content = shy_unwrap.sub(style_unwrap, content)
        return content
    def txt_process(self, match):
        from calibre.ebooks.txt.processor import convert_basic, separate_paragraphs_single_line
        content = match.group('text')
        content = separate_paragraphs_single_line(content)
        content = convert_basic(content, epub_split_size_kb=0)
        return content
    def markup_pre(self, html):
        pre = re.compile(r'<pre>', re.IGNORECASE)
        if len(pre.findall(html)) >= 1:
            self.log.debug("Running Text Processing")
            outerhtml = re.compile(r'.*?(?<=<pre>)(?P<text>.*?)</pre>', re.IGNORECASE|re.DOTALL)
            html = outerhtml.sub(self.txt_process, html)
            from calibre.ebooks.conversion.preprocess import convert_entities
            html = re.sub(r'&(\S+?);', convert_entities, html)
        else:
            # Add markup naively
            # TODO - find out if there are cases where there are more than one <pre> tag or
            # other types of unmarked html and handle them in some better fashion
            add_markup = re.compile('(?<!>)(\n)')
            html = add_markup.sub('</p>\n<p>', html)
        return html
    def arrange_htm_line_endings(self, html):
        html = re.sub(r"\s*</(?P<tag>p|div)>", "</"+"\\g<tag>"+">\n", html)
        html = re.sub(r"\s*<(?P<tag>p|div)(?P<style>[^>]*)>\s*", "\n<"+"\\g<tag>"+"\\g<style>"+">", html)
        return html
    def fix_nbsp_indents(self, html):
        txtindent = re.compile(unicode_type(r'<(?P<tagtype>p|div)(?P<formatting>[^>]*)>\s*(?P<span>(<span[^>]*>\s*)+)?\s*(\u00a0){2,}'), re.IGNORECASE)
        html = txtindent.sub(self.insert_indent, html)
        if self.found_indents > 1:
            self.log.debug("replaced "+unicode_type(self.found_indents)+ " nbsp indents with inline styles")
        return html
    def cleanup_markup(self, html):
        # remove remaining non-breaking spaces
        html = re.sub(unicode_type(r'\u00a0'), ' ', html)
        # Get rid of various common microsoft specific tags which can cause issues later
        # Get rid of empty <o:p> tags to simplify other processing
        html = re.sub(unicode_type(r'\s*<o:p>\s*</o:p>'), ' ', html)
        # Delete microsoft 'smart' tags
        html = re.sub('(?i)</?st1:\\w+>', '', html)
        # Re-open self closing paragraph tags
        html = re.sub('<p[^>/]*/>', '<p> </p>', html)
        # Get rid of empty span, bold, font, em, & italics tags
        fmt_tags = 'font|[ibu]|em|strong'
        open_fmt_pat, close_fmt_pat = r'<(?:{})(?:\s[^>]*)?>'.format(fmt_tags), '</(?:{})>'.format(fmt_tags)
        for i in range(2):
            html = re.sub(r"\s*<span[^>]*>\s*(<span[^>]*>\s*</span>){0,2}\s*</span>\s*", " ", html)
            html = re.sub(
                r"\s*{open}\s*({open}\s*{close}\s*){{0,2}}\s*{close}".format(open=open_fmt_pat, close=close_fmt_pat) , " ", html)
        # delete surrounding divs from empty paragraphs
        html = re.sub('<div[^>]*>\\s*<p[^>]*>\\s*</p>\\s*</div>', '<p> </p>', html)
        # Empty heading tags
        html = re.sub(r'(?i)<h\d+>\s*</h\d+>', '', html)
        self.deleted_nbsps = True
        return html
    def analyze_line_endings(self, html):
        '''
        determines the type of html line ending used most commonly in a document
        use before calling docanalysis functions
        '''
        paras_reg = re.compile('<p[^>]*>', re.IGNORECASE)
        spans_reg = re.compile('<span[^>]*>', re.IGNORECASE)
        paras = len(paras_reg.findall(html))
        spans = len(spans_reg.findall(html))
        if spans > 1:
            if float(paras) / float(spans) < 0.75:
                return 'spanned_html'
            else:
                return 'html'
        else:
            return 'html'
    def analyze_blanks(self, html):
        blanklines = self.blankreg.findall(html)
        lines = self.linereg.findall(html)
        if len(lines) > 1:
            self.log.debug("There are " + unicode_type(len(blanklines)) + " blank lines. " +
                    unicode_type(float(len(blanklines)) / float(len(lines))) + " percent blank")
            if float(len(blanklines)) / float(len(lines)) > 0.40:
                return True
            else:
                return False
    def cleanup_required(self):
        for option in ['unwrap_lines', 'markup_chapter_headings', 'format_scene_breaks', 'delete_blank_paragraphs']:
            if getattr(self.extra_opts, option, False):
                return True
        return False
    def merge_blanks(self, html, blanks_count=None):
        base_em = .5  # Baseline is 1.5em per blank line, 1st line is .5 em css and 1em for the nbsp
        em_per_line = 1.5  # Add another 1.5 em for each additional blank
        def merge_matches(match):
            to_merge = match.group(0)
            lines = float(len(self.single_blank.findall(to_merge))) - 1.
            em = base_em + (em_per_line * lines)
            if to_merge.find('whitespace'):
                newline = self.any_multi_blank.sub('\n<p class="whitespace'+unicode_type(int(em * 10))+
                                                   '" style="text-align:center; margin-top:'+unicode_type(em)+'em"> </p>', match.group(0))
            else:
                newline = self.any_multi_blank.sub('\n<p class="softbreak'+unicode_type(int(em * 10))+
                                                   '" style="text-align:center; margin-top:'+unicode_type(em)+'em"> </p>', match.group(0))
            return newline
        html = self.any_multi_blank.sub(merge_matches, html)
        return html
    def detect_whitespace(self, html):
        blanks_around_headings = re.compile(
            r'(?P<initparas>(<(p|div)[^>]*>\s*</(p|div)>\s*){1,}\s*)?'
            r'(?P<content><h(?P<hnum>\d+)[^>]*>.*?</h(?P=hnum)>)(?P<endparas>\s*(<(p|div)[^>]*>\s*</(p|div)>\s*){1,})?', re.IGNORECASE|re.DOTALL)
        blanks_around_scene_breaks = re.compile(
            r'(?P<initparas>(<(p|div)[^>]*>\s*</(p|div)>\s*){1,}\s*)?'
            r'(?P<content><p class="scenebreak"[^>]*>.*?</p>)(?P<endparas>\s*(<(p|div)[^>]*>\s*</(p|div)>\s*){1,})?', re.IGNORECASE|re.DOTALL)
        blanks_n_nopunct = re.compile(
            r'(?P<initparas>(<p[^>]*>\s*</p>\s*){1,}\s*)?<p[^>]*>\s*(<(span|[ibu]|em|strong|font)[^>]*>\s*)*'
            r'.{1,100}?[^\W](</(span|[ibu]|em|strong|font)>\s*)*</p>(?P<endparas>\s*(<p[^>]*>\s*</p>\s*){1,})?', re.IGNORECASE|re.DOTALL)
        def merge_header_whitespace(match):
            initblanks = match.group('initparas')
            endblanks = match.group('endparas')
            content = match.group('content')
            top_margin = ''
            bottom_margin = ''
            if initblanks is not None:
                top_margin = 'margin-top:'+unicode_type(len(self.single_blank.findall(initblanks)))+'em;'
            if endblanks is not None:
                bottom_margin = 'margin-bottom:'+unicode_type(len(self.single_blank.findall(endblanks)))+'em;'
            if initblanks is None and endblanks is None:
                return content
            elif content.find('scenebreak') != -1:
                return content
            else:
                content = re.sub('(?i)<h(?P<hnum>\\d+)[^>]*>', '\n\n<h'+'\\g<hnum>'+' style="'+top_margin+bottom_margin+'">', content)
            return content
        html = blanks_around_headings.sub(merge_header_whitespace, html)
        html = blanks_around_scene_breaks.sub(merge_header_whitespace, html)
        def markup_whitespaces(match):
            blanks = match.group(0)
            blanks = self.blankreg.sub('\n<p class="whitespace" style="text-align:center; margin-top:0em; margin-bottom:0em"> </p>', blanks)
            return blanks
        html = blanks_n_nopunct.sub(markup_whitespaces, html)
        if self.html_preprocess_sections > self.min_chapters:
            html = re.sub('(?si)^.*?(?=<h\\d)', markup_whitespaces, html)
        return html
    def detect_soft_breaks(self, html):
        line = '(?P<initline>'+self.line_open+'\\s*(?P<init_content>.*?)'+self.line_close+')'
        line_two = '(?P<line_two>'+re.sub('(ou|in|cha)', 'linetwo_', self.line_open)+ \
                     '\\s*(?P<line_two_content>.*?)'+re.sub('(ou|in|cha)', 'linetwo_', self.line_close)+')'
        div_break_candidate_pattern = line+'\\s*<div[^>]*>\\s*</div>\\s*'+line_two
        div_break_candidate = re.compile(r'%s' % div_break_candidate_pattern, re.IGNORECASE|re.UNICODE)
        def convert_div_softbreaks(match):
            init_is_paragraph = self.check_paragraph(match.group('init_content'))
            line_two_is_paragraph = self.check_paragraph(match.group('line_two_content'))
            if init_is_paragraph and line_two_is_paragraph:
                return (match.group('initline')+
                        '\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>\n'+
                        match.group('line_two'))
            else:
                return match.group(0)
        html = div_break_candidate.sub(convert_div_softbreaks, html)
        if not self.blanks_deleted and self.blanks_between_paragraphs:
            html = self.multi_blank.sub('\n<p class="softbreak" style="margin-top:1em; page-break-before:avoid; text-align:center"> </p>', html)
        else:
            html = self.blankreg.sub('\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>', html)
        return html
    def detect_scene_breaks(self, html):
        scene_break_regex = self.line_open+'(?!('+self.common_in_text_beginnings+'|.*?'+self.common_in_text_endings+ \
                                             '<))(?P<break>((?P<break_char>((?!\\s)\\W))\\s*(?P=break_char)?)+)\\s*'+self.line_close
        scene_breaks = re.compile(r'%s' % scene_break_regex, re.IGNORECASE|re.UNICODE)
        html = scene_breaks.sub(self.scene_break_open+'\\g<break>'+'</p>', html)
        return html
    def markup_user_break(self, replacement_break):
        '''
        Takes string a user supplies and wraps it in markup that will be centered with
        appropriate margins.  <hr> and <img> tags are allowed.  If the user specifies
        a style with width attributes in the <hr> tag then the appropriate margins are
        applied to wrapping divs.  This is because many ebook devices don't support margin:auto
        All other html is converted to text.
        '''
        hr_open = '<div id="scenebreak" style="margin-left: 45%; margin-right: 45%; margin-top:1.5em; margin-bottom:1.5em; page-break-before:avoid">'
        if re.findall('(<|>)', replacement_break):
            if re.match('^<hr', replacement_break):
                if replacement_break.find('width') != -1:
                    try:
                        width = int(re.sub('.*?width(:|=)(?P<wnum>\\d+).*', '\\g<wnum>', replacement_break))
                    except:
                        scene_break = hr_open+'<hr style="height: 3px; background:#505050" /></div>'
                        self.log.warn('Invalid replacement scene break'
                                ' expression, using default')
                    else:
                        replacement_break = re.sub('(?i)(width=\\d+\\%?|width:\\s*\\d+(\\%|px|pt|em)?;?)', '', replacement_break)
                        divpercent = (100 - width) // 2
                        hr_open = re.sub('45', unicode_type(divpercent), hr_open)
                        scene_break = hr_open+replacement_break+'</div>'
                else:
                    scene_break = hr_open+'<hr style="height: 3px; background:#505050" /></div>'
            elif re.match('^<img', replacement_break):
                scene_break = self.scene_break_open+replacement_break+'</p>'
            else:
                from calibre.utils.html2text import html2text
                replacement_break = html2text(replacement_break)
                replacement_break = re.sub('\\s', '&nbsp;', replacement_break)
                scene_break = self.scene_break_open+replacement_break+'</p>'
        else:
            replacement_break = re.sub('\\s', '&nbsp;', replacement_break)
            scene_break = self.scene_break_open+replacement_break+'</p>'
        return scene_break
    def check_paragraph(self, content):
        content = re.sub('\\s*</?span[^>]*>\\s*', '', content)
        if re.match('.*[\"\'.!?:]$', content):
            # print "detected this as a paragraph"
            return True
        else:
            return False
    def abbyy_processor(self, html):
        abbyy_line = re.compile('((?P<linestart><p\\sstyle="(?P<styles>[^\"]*?);?">)(?P<content>.*?)(?P<lineend></p>)|(?P<image><img[^>]*>))', re.IGNORECASE)
        empty_paragraph = '\n<p> </p>\n'
        self.in_blockquote = False
        self.previous_was_paragraph = False
        html = re.sub('</?a[^>]*>', '', html)
        def convert_styles(match):
            # print "raw styles are: "+match.group('styles')
            content = match.group('content')
            # print "raw content is: "+match.group('content')
            image = match.group('image')
            is_paragraph = False
            text_align = ''
            text_indent = ''
            paragraph_before = ''
            paragraph_after = ''
            blockquote_open = '\n<blockquote>\n'
            blockquote_close = '</blockquote>\n'
            indented_text = 'text-indent:3%;'
            blockquote_open_loop = ''
            blockquote_close_loop = ''
            debugabby = False
            if image:
                debugabby = True
                if self.in_blockquote:
                    self.in_blockquote = False
                    blockquote_close_loop = blockquote_close
                self.previous_was_paragraph = False
                return blockquote_close_loop+'\n'+image+'\n'
            else:
                styles = match.group('styles').split(';')
                is_paragraph = self.check_paragraph(content)
                # print "styles for this line are: "+unicode_type(styles)
                split_styles = []
                for style in styles:
                    # print "style is: "+unicode_type(style)
                    newstyle = style.split(':')
                    # print "newstyle is: "+unicode_type(newstyle)
                    split_styles.append(newstyle)
                styles = split_styles
                for style, setting in styles:
                    if style == 'text-align' and setting != 'left':
                        text_align = style+':'+setting+';'
                    if style == 'text-indent':
                        setting = int(re.sub('\\s*pt\\s*', '', setting))
                        if 9 < setting < 14:
                            text_indent = indented_text
                        else:
                            text_indent = style+':'+unicode_type(setting)+'pt;'
                    if style == 'padding':
                        setting = re.sub('pt', '', setting).split(' ')
                        if int(setting[1]) < 16 and int(setting[3]) < 16:
                            if self.in_blockquote:
                                debugabby = True
                                if is_paragraph:
                                    self.in_blockquote = False
                                    blockquote_close_loop = blockquote_close
                            if int(setting[3]) > 8 and text_indent == '':
                                text_indent = indented_text
                            if int(setting[0]) > 5:
                                paragraph_before = empty_paragraph
                            if int(setting[2]) > 5:
                                paragraph_after = empty_paragraph
                        elif not self.in_blockquote and self.previous_was_paragraph:
                            debugabby = True
                            self.in_blockquote = True
                            blockquote_open_loop = blockquote_open
                        if debugabby:
                            self.log.debug('\n\n******\n')
                            self.log.debug('padding top is: '+unicode_type(setting[0]))
                            self.log.debug('padding right is:' +unicode_type(setting[1]))
                            self.log.debug('padding bottom is: ' + unicode_type(setting[2]))
                            self.log.debug('padding left is: ' +unicode_type(setting[3]))
                # print "text-align is: "+unicode_type(text_align)
                # print "\n***\nline is:\n     "+unicode_type(match.group(0))+'\n'
                if debugabby:
                    # print "this line is a paragraph = "+unicode_type(is_paragraph)+", previous line was "+unicode_type(self.previous_was_paragraph)
                    self.log.debug("styles for this line were:", styles)
                    self.log.debug('newline is:')
                    self.log.debug(blockquote_open_loop+blockquote_close_loop+
                            paragraph_before+'<p style="'+text_indent+text_align+
                            '">'+content+'</p>'+paragraph_after+'\n\n\n\n\n')
                # print "is_paragraph is "+unicode_type(is_paragraph)+", previous_was_paragraph is "+unicode_type(self.previous_was_paragraph)
                self.previous_was_paragraph = is_paragraph
                # print "previous_was_paragraph is now set to "+unicode_type(self.previous_was_paragraph)+"\n\n\n"
                return blockquote_open_loop+blockquote_close_loop+paragraph_before+'<p style="'+text_indent+text_align+'">'+content+'</p>'+paragraph_after
        html = abbyy_line.sub(convert_styles, html)
        return html
    def __call__(self, html):
        self.log.debug("*********  Heuristic processing HTML  *********")
        # Count the words in the document to estimate how many chapters to look for and whether
        # other types of processing are attempted
        try:
            self.totalwords = self.get_word_count(html)
        except:
            self.log.warn("Can't get wordcount")
        if self.totalwords < 50:
            self.log.warn("flow is too short, not running heuristics")
            return html
        is_abbyy = self.is_abbyy(html)
        if is_abbyy:
            html = self.abbyy_processor(html)
        # Arrange line feeds and </p> tags so the line_length and no_markup functions work correctly
        html = self.arrange_htm_line_endings(html)
        # self.dump(html, 'after_arrange_line_endings')
        if self.cleanup_required():
            # ##### Check Markup ######
            #
            # some lit files don't have any <p> tags or equivalent (generally just plain text between
            # <pre> tags), check and  mark up line endings if required before proceeding
            # fix indents must run after this step
            if self.no_markup(html, 0.1):
                self.log.debug("not enough paragraph markers, adding now")
                # markup using text processing
                html = self.markup_pre(html)
        # Replace series of non-breaking spaces with text-indent
        if getattr(self.extra_opts, 'fix_indents', False):
            html = self.fix_nbsp_indents(html)
        if self.cleanup_required():
            # fix indents must run before this step, as it removes non-breaking spaces
            html = self.cleanup_markup(html)
        is_pdftohtml = self.is_pdftohtml(html)
        if is_pdftohtml:
            self.line_open = "<(?P<outer>p)[^>]*>(\\s*<[ibu][^>]*>)?\\s*"
            self.line_close = "\\s*(</[ibu][^>]*>\\s*)?</(?P=outer)>"
        # ADE doesn't render <br />, change to empty paragraphs
        # html = re.sub('<br[^>]*>', u'<p>\u00a0</p>', html)
        # Determine whether the document uses interleaved blank lines
        self.blanks_between_paragraphs = self.analyze_blanks(html)
        # detect chapters/sections to match xpath or splitting logic
        if getattr(self.extra_opts, 'markup_chapter_headings', False):
            html = self.markup_chapters(html, self.totalwords, self.blanks_between_paragraphs)
        # self.dump(html, 'after_chapter_markup')
        if getattr(self.extra_opts, 'italicize_common_cases', False):
            html = self.markup_italicis(html)
        # If more than 40% of the lines are empty paragraphs and the user has enabled delete
        # blank paragraphs then delete blank lines to clean up spacing
        if self.blanks_between_paragraphs and getattr(self.extra_opts, 'delete_blank_paragraphs', False):
            self.log.debug("deleting blank lines")
            self.blanks_deleted = True
            html = self.multi_blank.sub('\n<p class="softbreak" style="margin-top:.5em; page-break-before:avoid; text-align:center"> </p>', html)
            html = self.blankreg.sub('', html)
        # Determine line ending type
        # Some OCR sourced files have line breaks in the html using a combination of span & p tags
        # span are used for hard line breaks, p for new paragraphs.  Determine which is used so
        # that lines can be un-wrapped across page boundaries
        format = self.analyze_line_endings(html)
        # Check Line histogram to determine if the document uses hard line breaks, If 50% or
        # more of the lines break in the same region of the document then unwrapping is required
        docanalysis = DocAnalysis(format, html)
        hardbreaks = docanalysis.line_histogram(.50)
        self.log.debug("Hard line breaks check returned "+unicode_type(hardbreaks))
        # Calculate Length
        unwrap_factor = getattr(self.extra_opts, 'html_unwrap_factor', 0.4)
        length = docanalysis.line_length(unwrap_factor)
        self.log.debug("Median line length is " + unicode_type(length) + ", calculated with " + format + " format")
        # ##### Unwrap lines ######
        if getattr(self.extra_opts, 'unwrap_lines', False):
            # only go through unwrapping code if the histogram shows unwrapping is required or if the user decreased the default unwrap_factor
            if hardbreaks or unwrap_factor < 0.4:
                self.log.debug("Unwrapping required, unwrapping Lines")
                # Dehyphenate with line length limiters
                dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
                html = dehyphenator(html,'html', length)
                html = self.punctuation_unwrap(length, html, 'html')
        if getattr(self.extra_opts, 'dehyphenate', False):
            # dehyphenate in cleanup mode to fix anything previous conversions/editing missed
            self.log.debug("Fixing hyphenated content")
            dehyphenator = Dehyphenator(self.extra_opts.verbose, self.log)
            html = dehyphenator(html,'html_cleanup', length)
            html = dehyphenator(html, 'individual_words', length)
        # If still no sections after unwrapping mark split points on lines with no punctuation
        if self.html_preprocess_sections < self.min_chapters and getattr(self.extra_opts, 'markup_chapter_headings', False):
            self.log.debug("Looking for more split points based on punctuation,"
                    " currently have " + unicode_type(self.html_preprocess_sections))
            chapdetect3 = re.compile(
                r'<(?P<styles>(p|div)[^>]*)>\s*(?P<section>(<span[^>]*>)?\s*(?!([\W]+\s*)+)'
                r'(<[ibu][^>]*>){0,2}\s*(<span[^>]*>)?\s*(<[ibu][^>]*>){0,2}\s*(<span[^>]*>)?\s*'
                r'.?(?=[a-z#\-*\s]+<)([a-z#-*]+\s*){1,5}\s*\s*(</span>)?(</[ibu]>){0,2}\s*'
                r'(</span>)?\s*(</[ibu]>){0,2}\s*(</span>)?\s*</(p|div)>)', re.IGNORECASE)
            html = chapdetect3.sub(self.chapter_break, html)
        if getattr(self.extra_opts, 'renumber_headings', False):
            # search for places where a first or second level heading is immediately followed by another
            # top level heading.  demote the second heading to h3 to prevent splitting between chapter
            # headings and titles, images, etc
            doubleheading = re.compile(
                r'(?P<firsthead><h(1|2)[^>]*>.+?</h(1|2)>\s*(<(?!h\d)[^>]*>\s*)*)<h(1|2)(?P<secondhead>[^>]*>.+?)</h(1|2)>', re.IGNORECASE)
            html = doubleheading.sub('\\g<firsthead>'+'\n<h3'+'\\g<secondhead>'+'</h3>', html)
        # If scene break formatting is enabled, find all blank paragraphs that definitely aren't scenebreaks,
        # style it with the 'whitespace' class.  All remaining blank lines are styled as softbreaks.
        # Multiple sequential blank paragraphs are merged with appropriate margins
        # If non-blank scene breaks exist they are center aligned and styled with appropriate margins.
        if getattr(self.extra_opts, 'format_scene_breaks', False):
            self.log.debug('Formatting scene breaks')
            html = re.sub('(?i)<div[^>]*>\\s*<br(\\s?/)?>\\s*</div>', '<p></p>', html)
            html = self.detect_scene_breaks(html)
            html = self.detect_whitespace(html)
            html = self.detect_soft_breaks(html)
            blanks_count = len(self.any_multi_blank.findall(html))
            if blanks_count >= 1:
                html = self.merge_blanks(html, blanks_count)
            detected_scene_break = re.compile(r'<p class="scenebreak"[^>]*>.*?</p>')
            scene_break_count = len(detected_scene_break.findall(html))
            # If the user has enabled scene break replacement, then either softbreaks
            # or 'hard' scene breaks are replaced, depending on which is in use
            # Otherwise separator lines are centered, use a bit larger margin in this case
            replacement_break = getattr(self.extra_opts, 'replace_scene_breaks', None)
            if replacement_break:
                replacement_break = self.markup_user_break(replacement_break)
                if scene_break_count >= 1:
                    html = detected_scene_break.sub(replacement_break, html)
                    html = re.sub('<p\\s+class="softbreak"[^>]*>\\s*</p>', replacement_break, html)
                else:
                    html = re.sub('<p\\s+class="softbreak"[^>]*>\\s*</p>', replacement_break, html)
        if self.deleted_nbsps:
            # put back non-breaking spaces in empty paragraphs so they render correctly
            html = self.anyblank.sub('\n'+r'\g<openline>'+'\u00a0'+r'\g<closeline>', html)
        return html
--- a/ebook_converter/ebooks/docx/init.py
+++ b/ebook_converter/ebooks/docx/init.py
@@ -0,0 +1,11 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 class InvalidDOCX(ValueError):
    pass
--- a/ebook_converter/ebooks/docx/block_styles.py
+++ b/ebook_converter/ebooks/docx/block_styles.py
@@ -0,0 +1,478 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import numbers
 from collections import OrderedDict
 from polyglot.builtins import iteritems
 class Inherit(object):
    def __eq__(self, other):
        return other is self
    def __hash__(self):
        return id(self)
    def __lt__(self, other):
        return False
    def __gt__(self, other):
        return other is not self
    def __ge__(self, other):
        if self is other:
            return True
        return True
    def __le__(self, other):
        if self is other:
            return True
        return False
 inherit = Inherit()
 def binary_property(parent, name, XPath, get):
    vals = XPath('./w:%s' % name)(parent)
    if not vals:
        return inherit
    val = get(vals[0], 'w:val', 'on')
    return True if val in {'on', '1', 'true'} else False
 def simple_color(col, auto='black'):
    if not col or col == 'auto' or len(col) != 6:
        return auto
    return '#'+col
 def simple_float(val, mult=1.0):
    try:
        return float(val) * mult
    except (ValueError, TypeError, AttributeError, KeyError):
        pass
 def twips(val, mult=0.05):
    ''' Parse val as either a pure number representing twentieths of a point or a number followed by the suffix pt, representing pts.'''
    try:
        return float(val) * mult
    except (ValueError, TypeError, AttributeError, KeyError):
        if val and val.endswith('pt') and mult == 0.05:
            return twips(val[:-2], mult=1.0)
 LINE_STYLES = {  # {{{
    'basicBlackDashes': 'dashed',
    'basicBlackDots': 'dotted',
    'basicBlackSquares': 'dashed',
    'basicThinLines': 'solid',
    'dashDotStroked': 'groove',
    'dashed': 'dashed',
    'dashSmallGap': 'dashed',
    'dotDash': 'dashed',
    'dotDotDash': 'dashed',
    'dotted': 'dotted',
    'double': 'double',
    'inset': 'inset',
    'nil': 'none',
    'none': 'none',
    'outset': 'outset',
    'single': 'solid',
    'thick': 'solid',
    'thickThinLargeGap': 'double',
    'thickThinMediumGap': 'double',
    'thickThinSmallGap' : 'double',
    'thinThickLargeGap': 'double',
    'thinThickMediumGap': 'double',
    'thinThickSmallGap': 'double',
    'thinThickThinLargeGap': 'double',
    'thinThickThinMediumGap': 'double',
    'thinThickThinSmallGap': 'double',
    'threeDEmboss': 'ridge',
    'threeDEngrave': 'groove',
    'triple': 'double',
 }  # }}}
 # Read from XML {{{
 border_props = ('padding_%s', 'border_%s_width', 'border_%s_style', 'border_%s_color')
 border_edges = ('left', 'top', 'right', 'bottom', 'between')
 def read_single_border(parent, edge, XPath, get):
    color = style = width = padding = None
    for elem in XPath('./w:%s' % edge)(parent):
        c = get(elem, 'w:color')
        if c is not None:
            color = simple_color(c)
        s = get(elem, 'w:val')
        if s is not None:
            style = LINE_STYLES.get(s, 'solid')
        space = get(elem, 'w:space')
        if space is not None:
            try:
                padding = float(space)
            except (ValueError, TypeError):
                pass
        sz = get(elem, 'w:sz')
        if sz is not None:
            # we dont care about art borders (they are only used for page borders)
            try:
                width = min(96, max(2, float(sz))) / 8
            except (ValueError, TypeError):
                pass
    return {p:v for p, v in zip(border_props, (padding, width, style, color))}
 def read_border(parent, dest, XPath, get, border_edges=border_edges, name='pBdr'):
    vals = {k % edge:inherit for edge in border_edges for k in border_props}
    for border in XPath('./w:' + name)(parent):
        for edge in border_edges:
            for prop, val in iteritems(read_single_border(border, edge, XPath, get)):
                if val is not None:
                    vals[prop % edge] = val
    for key, val in iteritems(vals):
        setattr(dest, key, val)
 def border_to_css(edge, style, css):
    bs = getattr(style, 'border_%s_style' % edge)
    bc = getattr(style, 'border_%s_color' % edge)
    bw = getattr(style, 'border_%s_width' % edge)
    if isinstance(bw, numbers.Number):
        # WebKit needs at least 1pt to render borders and 3pt to render double borders
        bw = max(bw, (3 if bs == 'double' else 1))
    if bs is not inherit and bs is not None:
        css['border-%s-style' % edge] = bs
    if bc is not inherit and bc is not None:
        css['border-%s-color' % edge] = bc
    if bw is not inherit and bw is not None:
        if isinstance(bw, numbers.Number):
            bw = '%.3gpt' % bw
        css['border-%s-width' % edge] = bw
 def read_indent(parent, dest, XPath, get):
    padding_left = padding_right = text_indent = inherit
    for indent in XPath('./w:ind')(parent):
        l, lc = get(indent, 'w:left'), get(indent, 'w:leftChars')
        pl = simple_float(lc, 0.01) if lc is not None else simple_float(l, 0.05) if l is not None else None
        if pl is not None:
            padding_left = '%.3g%s' % (pl, 'em' if lc is not None else 'pt')
        r, rc = get(indent, 'w:right'), get(indent, 'w:rightChars')
        pr = simple_float(rc, 0.01) if rc is not None else simple_float(r, 0.05) if r is not None else None
        if pr is not None:
            padding_right = '%.3g%s' % (pr, 'em' if rc is not None else 'pt')
        h, hc = get(indent, 'w:hanging'), get(indent, 'w:hangingChars')
        fl, flc = get(indent, 'w:firstLine'), get(indent, 'w:firstLineChars')
        h = h if h is None else '-'+h
        hc = hc if hc is None else '-'+hc
        ti = (simple_float(hc, 0.01) if hc is not None else simple_float(h, 0.05) if h is not None else
              simple_float(flc, 0.01) if flc is not None else simple_float(fl, 0.05) if fl is not None else None)
        if ti is not None:
            text_indent = '%.3g%s' % (ti, 'em' if hc is not None or (h is None and flc is not None) else 'pt')
    setattr(dest, 'margin_left', padding_left)
    setattr(dest, 'margin_right', padding_right)
    setattr(dest, 'text_indent', text_indent)
 def read_justification(parent, dest, XPath, get):
    ans = inherit
    for jc in XPath('./w:jc[@w:val]')(parent):
        val = get(jc, 'w:val')
        if not val:
            continue
        if val in {'both', 'distribute'} or 'thai' in val or 'kashida' in val:
            ans = 'justify'
        elif val in {'left', 'center', 'right', 'start', 'end'}:
            ans = val
        elif val in {'start', 'end'}:
            ans = {'start':'left'}.get(val, 'right')
    setattr(dest, 'text_align', ans)
 def read_spacing(parent, dest, XPath, get):
    padding_top = padding_bottom = line_height = inherit
    for s in XPath('./w:spacing')(parent):
        a, al, aa = get(s, 'w:after'), get(s, 'w:afterLines'), get(s, 'w:afterAutospacing')
        pb = None if aa in {'on', '1', 'true'} else simple_float(al, 0.02) if al is not None else simple_float(a, 0.05) if a is not None else None
        if pb is not None:
            padding_bottom = '%.3g%s' % (pb, 'ex' if al is not None else 'pt')
        b, bl, bb = get(s, 'w:before'), get(s, 'w:beforeLines'), get(s, 'w:beforeAutospacing')
        pt = None if bb in {'on', '1', 'true'} else simple_float(bl, 0.02) if bl is not None else simple_float(b, 0.05) if b is not None else None
        if pt is not None:
            padding_top = '%.3g%s' % (pt, 'ex' if bl is not None else 'pt')
        l, lr = get(s, 'w:line'), get(s, 'w:lineRule', 'auto')
        if l is not None:
            lh = simple_float(l, 0.05) if lr in {'exact', 'atLeast'} else simple_float(l, 1/240.0)
            if lh is not None:
                line_height = '%.3g%s' % (lh, 'pt' if lr in {'exact', 'atLeast'} else '')
    setattr(dest, 'margin_top', padding_top)
    setattr(dest, 'margin_bottom', padding_bottom)
    setattr(dest, 'line_height', line_height)
 def read_shd(parent, dest, XPath, get):
    ans = inherit
    for shd in XPath('./w:shd[@w:fill]')(parent):
        val = get(shd, 'w:fill')
        if val:
            ans = simple_color(val, auto='transparent')
    setattr(dest, 'background_color', ans)
 def read_numbering(parent, dest, XPath, get):
    lvl = num_id = inherit
    for np in XPath('./w:numPr')(parent):
        for ilvl in XPath('./w:ilvl[@w:val]')(np):
            try:
                lvl = int(get(ilvl, 'w:val'))
            except (ValueError, TypeError):
                pass
        for num in XPath('./w:numId[@w:val]')(np):
            num_id = get(num, 'w:val')
    setattr(dest, 'numbering_id', num_id)
    setattr(dest, 'numbering_level', lvl)
 class Frame(object):
    all_attributes = ('drop_cap', 'h', 'w', 'h_anchor', 'h_rule', 'v_anchor', 'wrap',
                      'h_space', 'v_space', 'lines', 'x_align', 'y_align', 'x', 'y')
    def __init__(self, fp, XPath, get):
        self.drop_cap = get(fp, 'w:dropCap', 'none')
        try:
            self.h = int(get(fp, 'w:h'))/20
        except (ValueError, TypeError):
            self.h = 0
        try:
            self.w = int(get(fp, 'w:w'))/20
        except (ValueError, TypeError):
            self.w = None
        try:
            self.x = int(get(fp, 'w:x'))/20
        except (ValueError, TypeError):
            self.x = 0
        try:
            self.y = int(get(fp, 'w:y'))/20
        except (ValueError, TypeError):
            self.y = 0
        self.h_anchor = get(fp, 'w:hAnchor', 'page')
        self.h_rule = get(fp, 'w:hRule', 'auto')
        self.v_anchor = get(fp, 'w:vAnchor', 'page')
        self.wrap = get(fp, 'w:wrap', 'around')
        self.x_align = get(fp, 'w:xAlign')
        self.y_align = get(fp, 'w:yAlign')
        try:
            self.h_space = int(get(fp, 'w:hSpace'))/20
        except (ValueError, TypeError):
            self.h_space = 0
        try:
            self.v_space = int(get(fp, 'w:vSpace'))/20
        except (ValueError, TypeError):
            self.v_space = 0
        try:
            self.lines = int(get(fp, 'w:lines'))
        except (ValueError, TypeError):
            self.lines = 1
    def css(self, page):
        is_dropcap = self.drop_cap in {'drop', 'margin'}
        ans = {'overflow': 'hidden'}
        if is_dropcap:
            ans['float'] = 'left'
            ans['margin'] = '0'
            ans['padding-right'] = '0.2em'
        else:
            if self.h_rule != 'auto':
                t = 'min-height' if self.h_rule == 'atLeast' else 'height'
                ans[t] = '%.3gpt' % self.h
            if self.w is not None:
                ans['width'] = '%.3gpt' % self.w
            ans['padding-top'] = ans['padding-bottom'] = '%.3gpt' % self.v_space
            if self.wrap not in {None, 'none'}:
                ans['padding-left'] = ans['padding-right'] = '%.3gpt' % self.h_space
                if self.x_align is None:
                    fl = 'left' if self.x/page.width < 0.5 else 'right'
                else:
                    fl = 'right' if self.x_align == 'right' else 'left'
                ans['float'] = fl
        return ans
    def __eq__(self, other):
        for x in self.all_attributes:
            if getattr(other, x, inherit) != getattr(self, x):
                return False
        return True
    def __ne__(self, other):
        return not self.__eq__(other)
 def read_frame(parent, dest, XPath, get):
    ans = inherit
    for fp in XPath('./w:framePr')(parent):
        ans = Frame(fp, XPath, get)
    setattr(dest, 'frame', ans)
 # }}}
 class ParagraphStyle(object):
    all_properties = (
        'adjustRightInd', 'autoSpaceDE', 'autoSpaceDN', 'bidi',
        'contextualSpacing', 'keepLines', 'keepNext', 'mirrorIndents',
        'pageBreakBefore', 'snapToGrid', 'suppressLineNumbers',
        'suppressOverlap', 'topLinePunct', 'widowControl', 'wordWrap',
        # Border margins padding
        'border_left_width', 'border_left_style', 'border_left_color', 'padding_left',
        'border_top_width', 'border_top_style', 'border_top_color', 'padding_top',
        'border_right_width', 'border_right_style', 'border_right_color', 'padding_right',
        'border_bottom_width', 'border_bottom_style', 'border_bottom_color', 'padding_bottom',
        'border_between_width', 'border_between_style', 'border_between_color', 'padding_between',
        'margin_left', 'margin_top', 'margin_right', 'margin_bottom',
        # Misc.
        'text_indent', 'text_align', 'line_height', 'background_color',
        'numbering_id', 'numbering_level', 'font_family', 'font_size', 'color', 'frame',
        'cs_font_size', 'cs_font_family',
    )
    def __init__(self, namespace, pPr=None):
        self.namespace = namespace
        self.linked_style = None
        if pPr is None:
            for p in self.all_properties:
                setattr(self, p, inherit)
        else:
            for p in (
                'adjustRightInd', 'autoSpaceDE', 'autoSpaceDN', 'bidi',
                'contextualSpacing', 'keepLines', 'keepNext', 'mirrorIndents',
                'pageBreakBefore', 'snapToGrid', 'suppressLineNumbers',
                'suppressOverlap', 'topLinePunct', 'widowControl', 'wordWrap',
            ):
                setattr(self, p, binary_property(pPr, p, namespace.XPath, namespace.get))
            for x in ('border', 'indent', 'justification', 'spacing', 'shd', 'numbering', 'frame'):
                f = read_funcs[x]
                f(pPr, self, namespace.XPath, namespace.get)
            for s in namespace.XPath('./w:pStyle[@w:val]')(pPr):
                self.linked_style = namespace.get(s, 'w:val')
            self.font_family = self.font_size = self.color = self.cs_font_size = self.cs_font_family = inherit
        self._css = None
        self._border_key = None
    def update(self, other):
        for prop in self.all_properties:
            nval = getattr(other, prop)
            if nval is not inherit:
                setattr(self, prop, nval)
        if other.linked_style is not None:
            self.linked_style = other.linked_style
    def resolve_based_on(self, parent):
        for p in self.all_properties:
            val = getattr(self, p)
            if val is inherit:
                setattr(self, p, getattr(parent, p))
    @property
    def css(self):
        if self._css is None:
            self._css = c = OrderedDict()
            if self.keepLines is True:
                c['page-break-inside'] = 'avoid'
            if self.pageBreakBefore is True:
                c['page-break-before'] = 'always'
            if self.keepNext is True:
                c['page-break-after'] = 'avoid'
            for edge in ('left', 'top', 'right', 'bottom'):
                border_to_css(edge, self, c)
                val = getattr(self, 'padding_%s' % edge)
                if val is not inherit:
                    c['padding-%s' % edge] = '%.3gpt' % val
                val = getattr(self, 'margin_%s' % edge)
                if val is not inherit:
                    c['margin-%s' % edge] = val
            if self.line_height not in {inherit, '1'}:
                c['line-height'] = self.line_height
            for x in ('text_indent', 'background_color', 'font_family', 'font_size', 'color'):
                val = getattr(self, x)
                if val is not inherit:
                    if x == 'font_size':
                        val = '%.3gpt' % val
                    c[x.replace('_', '-')] = val
            ta = self.text_align
            if ta is not inherit:
                if self.bidi is True:
                    ta = {'left':'right', 'right':'left'}.get(ta, ta)
                c['text-align'] = ta
        return self._css
    @property
    def border_key(self):
        if self._border_key is None:
            k = []
            for edge in border_edges:
                for prop in border_props:
                    prop = prop % edge
                    k.append(getattr(self, prop))
            self._border_key = tuple(k)
        return self._border_key
    def has_identical_borders(self, other_style):
        return self.border_key == getattr(other_style, 'border_key', None)
    def clear_borders(self):
        for edge in border_edges[:-1]:
            for prop in ('width', 'color', 'style'):
                setattr(self, 'border_%s_%s' % (edge, prop), inherit)
    def clone_border_styles(self):
        style = ParagraphStyle(self.namespace)
        for edge in border_edges[:-1]:
            for prop in ('width', 'color', 'style'):
                attr = 'border_%s_%s' % (edge, prop)
                setattr(style, attr, getattr(self, attr))
        return style
    def apply_between_border(self):
        for prop in ('width', 'color', 'style'):
            setattr(self, 'border_bottom_%s' % prop, getattr(self, 'border_between_%s' % prop))
    def has_visible_border(self):
        for edge in border_edges[:-1]:
            bw, bs = getattr(self, 'border_%s_width' % edge), getattr(self, 'border_%s_style' % edge)
            if bw is not inherit and bw and bs is not inherit and bs != 'none':
                return True
        return False
 read_funcs = {k[5:]:v for k, v in iteritems(globals()) if k.startswith('read_')}
--- a/ebook_converter/ebooks/docx/char_styles.py
+++ b/ebook_converter/ebooks/docx/char_styles.py
@@ -0,0 +1,302 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from collections import OrderedDict
 from calibre.ebooks.docx.block_styles import (  # noqa
    inherit, simple_color, LINE_STYLES, simple_float, binary_property, read_shd)
 # Read from XML {{{
 def read_text_border(parent, dest, XPath, get):
    border_color = border_style = border_width = padding = inherit
    elems = XPath('./w:bdr')(parent)
    if elems and elems[0].attrib:
        border_color = simple_color('auto')
        border_style = 'none'
        border_width = 1
    for elem in elems:
        color = get(elem, 'w:color')
        if color is not None:
            border_color = simple_color(color)
        style = get(elem, 'w:val')
        if style is not None:
            border_style = LINE_STYLES.get(style, 'solid')
        space = get(elem, 'w:space')
        if space is not None:
            try:
                padding = float(space)
            except (ValueError, TypeError):
                pass
        sz = get(elem, 'w:sz')
        if sz is not None:
            # we dont care about art borders (they are only used for page borders)
            try:
                # A border of less than 1pt is not rendered by WebKit
                border_width = min(96, max(8, float(sz))) / 8
            except (ValueError, TypeError):
                pass
    setattr(dest, 'border_color', border_color)
    setattr(dest, 'border_style', border_style)
    setattr(dest, 'border_width', border_width)
    setattr(dest, 'padding', padding)
 def read_color(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:color[@w:val]')(parent):
        val = get(col, 'w:val')
        if not val:
            continue
        ans = simple_color(val)
    setattr(dest, 'color', ans)
 def convert_highlight_color(val):
    return {
        'darkBlue': '#000080', 'darkCyan': '#008080', 'darkGray': '#808080',
        'darkGreen': '#008000', 'darkMagenta': '#800080', 'darkRed': '#800000', 'darkYellow': '#808000',
        'lightGray': '#c0c0c0'}.get(val, val)
 def read_highlight(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:highlight[@w:val]')(parent):
        val = get(col, 'w:val')
        if not val:
            continue
        if not val or val == 'none':
            val = 'transparent'
        else:
            val = convert_highlight_color(val)
        ans = val
    setattr(dest, 'highlight', ans)
 def read_lang(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:lang[@w:val]')(parent):
        val = get(col, 'w:val')
        if not val:
            continue
        try:
            code = int(val, 16)
        except (ValueError, TypeError):
            ans = val
        else:
            from calibre.ebooks.docx.lcid import lcid
            val = lcid.get(code, None)
            if val:
                ans = val
    setattr(dest, 'lang', ans)
 def read_letter_spacing(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:spacing[@w:val]')(parent):
        val = simple_float(get(col, 'w:val'), 0.05)
        if val is not None:
            ans = val
    setattr(dest, 'letter_spacing', ans)
 def read_underline(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:u[@w:val]')(parent):
        val = get(col, 'w:val')
        if val:
            ans = val if val == 'none' else 'underline'
    setattr(dest, 'text_decoration', ans)
 def read_vert_align(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:vertAlign[@w:val]')(parent):
        val = get(col, 'w:val')
        if val and val in {'baseline', 'subscript', 'superscript'}:
            ans = val
    setattr(dest, 'vert_align', ans)
 def read_position(parent, dest, XPath, get):
    ans = inherit
    for col in XPath('./w:position[@w:val]')(parent):
        val = get(col, 'w:val')
        try:
            ans = float(val)/2.0
        except Exception:
            pass
    setattr(dest, 'position', ans)
 def read_font(parent, dest, XPath, get):
    ff = inherit
    for col in XPath('./w:rFonts')(parent):
        val = get(col, 'w:asciiTheme')
        if val:
            val = '|%s|' % val
        else:
            val = get(col, 'w:ascii')
        if val:
            ff = val
    setattr(dest, 'font_family', ff)
    for col in XPath('./w:sz[@w:val]')(parent):
        val = simple_float(get(col, 'w:val'), 0.5)
        if val is not None:
            setattr(dest, 'font_size', val)
            return
    setattr(dest, 'font_size', inherit)
 def read_font_cs(parent, dest, XPath, get):
    ff = inherit
    for col in XPath('./w:rFonts')(parent):
        val = get(col, 'w:csTheme')
        if val:
            val = '|%s|' % val
        else:
            val = get(col, 'w:cs')
        if val:
            ff = val
    setattr(dest, 'cs_font_family', ff)
    for col in XPath('./w:szCS[@w:val]')(parent):
        val = simple_float(get(col, 'w:val'), 0.5)
        if val is not None:
            setattr(dest, 'font_size', val)
            return
    setattr(dest, 'cs_font_size', inherit)
 # }}}
 class RunStyle(object):
    all_properties = {
        'b', 'bCs', 'caps', 'cs', 'dstrike', 'emboss', 'i', 'iCs', 'imprint',
        'rtl', 'shadow', 'smallCaps', 'strike', 'vanish', 'webHidden',
        'border_color', 'border_style', 'border_width', 'padding', 'color', 'highlight', 'background_color',
        'letter_spacing', 'font_size', 'text_decoration', 'vert_align', 'lang', 'font_family', 'position',
        'cs_font_size', 'cs_font_family'
    }
    toggle_properties = {
        'b', 'bCs', 'caps', 'emboss', 'i', 'iCs', 'imprint', 'shadow', 'smallCaps', 'strike', 'vanish',
    }
    def __init__(self, namespace, rPr=None):
        self.namespace = namespace
        self.linked_style = None
        if rPr is None:
            for p in self.all_properties:
                setattr(self, p, inherit)
        else:
            X, g = namespace.XPath, namespace.get
            for p in (
                'b', 'bCs', 'caps', 'cs', 'dstrike', 'emboss', 'i', 'iCs', 'imprint', 'rtl', 'shadow',
                'smallCaps', 'strike', 'vanish', 'webHidden',
            ):
                setattr(self, p, binary_property(rPr, p, X, g))
            read_font(rPr, self, X, g)
            read_font_cs(rPr, self, X, g)
            read_text_border(rPr, self, X, g)
            read_color(rPr, self, X, g)
            read_highlight(rPr, self, X, g)
            read_shd(rPr, self, X, g)
            read_letter_spacing(rPr, self, X, g)
            read_underline(rPr, self, X, g)
            read_vert_align(rPr, self, X, g)
            read_position(rPr, self, X, g)
            read_lang(rPr, self, X, g)
            for s in X('./w:rStyle[@w:val]')(rPr):
                self.linked_style = g(s, 'w:val')
        self._css = None
    def update(self, other):
        for prop in self.all_properties:
            nval = getattr(other, prop)
            if nval is not inherit:
                setattr(self, prop, nval)
        if other.linked_style is not None:
            self.linked_style = other.linked_style
    def resolve_based_on(self, parent):
        for p in self.all_properties:
            val = getattr(self, p)
            if val is inherit:
                setattr(self, p, getattr(parent, p))
    def get_border_css(self, ans):
        for x in ('color', 'style', 'width'):
            val = getattr(self, 'border_'+x)
            if x == 'width' and val is not inherit:
                val = '%.3gpt' % val
            if val is not inherit:
                ans['border-%s' % x] = val
    def clear_border_css(self):
        for x in ('color', 'style', 'width'):
            setattr(self, 'border_'+x, inherit)
    @property
    def css(self):
        if self._css is None:
            c = self._css = OrderedDict()
            td = set()
            if self.text_decoration is not inherit:
                td.add(self.text_decoration)
            if self.strike and self.strike is not inherit:
                td.add('line-through')
            if self.dstrike and self.dstrike is not inherit:
                td.add('line-through')
            if td:
                c['text-decoration'] = ' '.join(td)
            if self.caps is True:
                c['text-transform'] = 'uppercase'
            if self.i is True:
                c['font-style'] = 'italic'
            if self.shadow and self.shadow is not inherit:
                c['text-shadow'] = '2px 2px'
            if self.smallCaps is True:
                c['font-variant'] = 'small-caps'
            if self.vanish is True or self.webHidden is True:
                c['display'] = 'none'
            self.get_border_css(c)
            if self.padding is not inherit:
                c['padding'] = '%.3gpt' % self.padding
            for x in ('color', 'background_color'):
                val = getattr(self, x)
                if val is not inherit:
                    c[x.replace('_', '-')] = val
            for x in ('letter_spacing', 'font_size'):
                val = getattr(self, x)
                if val is not inherit:
                    c[x.replace('_', '-')] = '%.3gpt' % val
            if self.position is not inherit:
                c['vertical-align'] = '%.3gpt' % self.position
            if self.highlight is not inherit and self.highlight != 'transparent':
                c['background-color'] = self.highlight
            if self.b:
                c['font-weight'] = 'bold'
            if self.font_family is not inherit:
                c['font-family'] = self.font_family
        return self._css
    def same_border(self, other):
        return self.get_border_css({}) == other.get_border_css({})
--- a/ebook_converter/ebooks/docx/cleanup.py
+++ b/ebook_converter/ebooks/docx/cleanup.py
@@ -0,0 +1,235 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import os
 from polyglot.builtins import itervalues, range
 NBSP = '\xa0'
 def mergeable(previous, current):
    if previous.tail or current.tail:
        return False
    if previous.get('class', None) != current.get('class', None):
        return False
    if current.get('id', False):
        return False
    for attr in ('style', 'lang', 'dir'):
        if previous.get(attr) != current.get(attr):
            return False
    try:
        return next(previous.itersiblings()) is current
    except StopIteration:
        return False
 def append_text(parent, text):
    if len(parent) > 0:
        parent[-1].tail = (parent[-1].tail or '') + text
    else:
        parent.text = (parent.text or '') + text
 def merge(parent, span):
    if span.text:
        append_text(parent, span.text)
    for child in span:
        parent.append(child)
    if span.tail:
        append_text(parent, span.tail)
    span.getparent().remove(span)
 def merge_run(run):
    parent = run[0]
    for span in run[1:]:
        merge(parent, span)
 def liftable(css):
    # A <span> is liftable if all its styling would work just as well if it is
    # specified on the parent element.
    prefixes = {x.partition('-')[0] for x in css}
    return not (prefixes - {'text', 'font', 'letter', 'color', 'background'})
 def add_text(elem, attr, text):
    old = getattr(elem, attr) or ''
    setattr(elem, attr, old + text)
 def lift(span):
    # Replace an element by its content (text, children and tail)
    parent = span.getparent()
    idx = parent.index(span)
    try:
        last_child = span[-1]
    except IndexError:
        last_child = None
    if span.text:
        if idx == 0:
            add_text(parent, 'text', span.text)
        else:
            add_text(parent[idx - 1], 'tail', span.text)
    for child in reversed(span):
        parent.insert(idx, child)
    parent.remove(span)
    if span.tail:
        if last_child is None:
            if idx == 0:
                add_text(parent, 'text', span.tail)
            else:
                add_text(parent[idx - 1], 'tail', span.tail)
        else:
            add_text(last_child, 'tail', span.tail)
 def before_count(root, tag, limit=10):
    body = root.xpath('//body[1]')
    if not body:
        return limit
    ans = 0
    for elem in body[0].iterdescendants():
        if elem is tag:
            return ans
        ans += 1
        if ans > limit:
            return limit
 def wrap_contents(tag_name, elem):
    wrapper = elem.makeelement(tag_name)
    wrapper.text, elem.text = elem.text, ''
    for child in elem:
        elem.remove(child)
        wrapper.append(child)
    elem.append(wrapper)
 def cleanup_markup(log, root, styles, dest_dir, detect_cover, XPath):
    # Apply vertical-align
    for span in root.xpath('//span[@data-docx-vert]'):
        wrap_contents(span.attrib.pop('data-docx-vert'), span)
    # Move <hr>s outside paragraphs, if possible.
    pancestor = XPath('|'.join('ancestor::%s[1]' % x for x in ('p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6')))
    for hr in root.xpath('//span/hr'):
        p = pancestor(hr)
        if p:
            p = p[0]
            descendants = tuple(p.iterdescendants())
            if descendants[-1] is hr:
                parent = p.getparent()
                idx = parent.index(p)
                parent.insert(idx+1, hr)
                hr.tail = '\n\t'
    # Merge consecutive spans that have the same styling
    current_run = []
    for span in root.xpath('//span'):
        if not current_run:
            current_run.append(span)
        else:
            last = current_run[-1]
            if mergeable(last, span):
                current_run.append(span)
            else:
                if len(current_run) > 1:
                    merge_run(current_run)
                current_run = [span]
    # Process dir attributes
    class_map = dict(itervalues(styles.classes))
    parents = ('p', 'div') + tuple('h%d' % i for i in range(1, 7))
    for parent in root.xpath('//*[(%s)]' % ' or '.join('name()="%s"' % t for t in parents)):
        # Ensure that children of rtl parents that are not rtl have an
        # explicit dir set. Also, remove dir from children if it is the same as
        # that of the parent.
        if len(parent):
            parent_dir = parent.get('dir')
            for child in parent.iterchildren('span'):
                child_dir = child.get('dir')
                if parent_dir == 'rtl' and child_dir != 'rtl':
                    child_dir = 'ltr'
                    child.set('dir', child_dir)
                if child_dir and child_dir == parent_dir:
                    child.attrib.pop('dir')
    # Remove unnecessary span tags that are the only child of a parent block
    # element
    for parent in root.xpath('//*[(%s) and count(span)=1]' % ' or '.join('name()="%s"' % t for t in parents)):
        if len(parent) == 1 and not parent.text and not parent[0].tail and not parent[0].get('id', None):
            # We have a block whose contents are entirely enclosed in a <span>
            span = parent[0]
            span_class = span.get('class', None)
            span_css = class_map.get(span_class, {})
            span_dir = span.get('dir')
            if liftable(span_css) and (not span_dir or span_dir == parent.get('dir')):
                pclass = parent.get('class', None)
                if span_class:
                    pclass = (pclass + ' ' + span_class) if pclass else span_class
                    parent.set('class', pclass)
                parent.text = span.text
                parent.remove(span)
                if span.get('lang'):
                    parent.set('lang', span.get('lang'))
                if span.get('dir'):
                    parent.set('dir', span.get('dir'))
                for child in span:
                    parent.append(child)
    # Make spans whose only styling is bold or italic into <b> and <i> tags
    for span in root.xpath('//span[@class and not(@style)]'):
        css = class_map.get(span.get('class', None), {})
        if len(css) == 1:
            if css == {'font-style':'italic'}:
                span.tag = 'i'
                del span.attrib['class']
            elif css == {'font-weight':'bold'}:
                span.tag = 'b'
                del span.attrib['class']
    # Get rid of <span>s that have no styling
    for span in root.xpath('//span[not(@class or @id or @style or @lang or @dir)]'):
        lift(span)
    # Convert <p><br style="page-break-after:always"> </p> style page breaks
    # into something the viewer will render as a page break
    for p in root.xpath('//p[br[@style="page-break-after:always"]]'):
        if len(p) == 1 and (not p[0].tail or not p[0].tail.strip()):
            p.remove(p[0])
            prefix = p.get('style', '')
            if prefix:
                prefix += '; '
            p.set('style', prefix + 'page-break-after:always')
            p.text = NBSP if not p.text else p.text
    if detect_cover:
        # Check if the first image in the document is possibly a cover
        img = root.xpath('//img[@src][1]')
        if img:
            img = img[0]
            path = os.path.join(dest_dir, img.get('src'))
            if os.path.exists(path) and before_count(root, img, limit=10) < 5:
                from calibre.utils.imghdr import identify
                try:
                    with lopen(path, 'rb') as imf:
                        fmt, width, height = identify(imf)
                except:
                    width, height, fmt = 0, 0, None  # noqa
                del fmt
                try:
                    is_cover = 0.8 <= height/width <= 1.8 and height*width >= 160000
                except ZeroDivisionError:
                    is_cover = False
                if is_cover:
                    log.debug('Detected an image that looks like a cover')
                    img.getparent().remove(img)
                    return path
--- a/ebook_converter/ebooks/docx/container.py
+++ b/ebook_converter/ebooks/docx/container.py
@@ -0,0 +1,268 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import os, sys, shutil
 from lxml import etree
 from calibre import walk, guess_type
 from calibre.ebooks.metadata import string_to_authors, authors_to_sort_string
 from calibre.ebooks.metadata.book.base import Metadata
 from calibre.ebooks.docx import InvalidDOCX
 from calibre.ebooks.docx.names import DOCXNamespace
 from calibre.ptempfile import PersistentTemporaryDirectory
 from calibre.utils.localization import canonicalize_lang
 from calibre.utils.logging import default_log
 from calibre.utils.zipfile import ZipFile
 from calibre.utils.xml_parse import safe_xml_fromstring
 def fromstring(raw, parser=None):
    return safe_xml_fromstring(raw)
 # Read metadata {{{
 def read_doc_props(raw, mi, XPath):
    root = fromstring(raw)
    titles = XPath('//dc:title')(root)
    if titles:
        title = titles[0].text
        if title and title.strip():
            mi.title = title.strip()
    tags = []
    for subject in XPath('//dc:subject')(root):
        if subject.text and subject.text.strip():
            tags.append(subject.text.strip().replace(',', '_'))
    for keywords in XPath('//cp:keywords')(root):
        if keywords.text and keywords.text.strip():
            for x in keywords.text.split():
                tags.extend(y.strip() for y in x.split(',') if y.strip())
    if tags:
        mi.tags = tags
    authors = XPath('//dc:creator')(root)
    aut = []
    for author in authors:
        if author.text and author.text.strip():
            aut.extend(string_to_authors(author.text))
    if aut:
        mi.authors = aut
        mi.author_sort = authors_to_sort_string(aut)
    desc = XPath('//dc:description')(root)
    if desc:
        raw = etree.tostring(desc[0], method='text', encoding='unicode')
        raw = raw.replace('_x000d_', '')  # Word 2007 mangles newlines in the summary
        mi.comments = raw.strip()
    langs = []
    for lang in XPath('//dc:language')(root):
        if lang.text and lang.text.strip():
            l = canonicalize_lang(lang.text)
            if l:
                langs.append(l)
    if langs:
        mi.languages = langs
 def read_app_props(raw, mi):
    root = fromstring(raw)
    company = root.xpath('//*[local-name()="Company"]')
    if company and company[0].text and company[0].text.strip():
        mi.publisher = company[0].text.strip()
 def read_default_style_language(raw, mi, XPath):
    root = fromstring(raw)
    for lang in XPath('/w:styles/w:docDefaults/w:rPrDefault/w:rPr/w:lang/@w:val')(root):
        lang = canonicalize_lang(lang)
        if lang:
            mi.languages = [lang]
            break
 # }}}
 class DOCX(object):
    def __init__(self, path_or_stream, log=None, extract=True):
        self.docx_is_transitional = True
        stream = path_or_stream if hasattr(path_or_stream, 'read') else open(path_or_stream, 'rb')
        self.name = getattr(stream, 'name', None) or '<stream>'
        self.log = log or default_log
        if extract:
            self.extract(stream)
        else:
            self.init_zipfile(stream)
        self.read_content_types()
        self.read_package_relationships()
        self.namespace = DOCXNamespace(self.docx_is_transitional)
    def init_zipfile(self, stream):
        self.zipf = ZipFile(stream)
        self.names = frozenset(self.zipf.namelist())
    def extract(self, stream):
        self.tdir = PersistentTemporaryDirectory('docx_container')
        try:
            zf = ZipFile(stream)
            zf.extractall(self.tdir)
        except:
            self.log.exception('DOCX appears to be invalid ZIP file, trying a'
                    ' more forgiving ZIP parser')
            from calibre.utils.localunzip import extractall
            stream.seek(0)
            extractall(stream, self.tdir)
        self.names = {}
        for f in walk(self.tdir):
            name = os.path.relpath(f, self.tdir).replace(os.sep, '/')
            self.names[name] = f
    def exists(self, name):
        return name in self.names
    def read(self, name):
        if hasattr(self, 'zipf'):
            return self.zipf.open(name).read()
        path = self.names[name]
        with open(path, 'rb') as f:
            return f.read()
    def read_content_types(self):
        try:
            raw = self.read('[Content_Types].xml')
        except KeyError:
            raise InvalidDOCX('The file %s docx file has no [Content_Types].xml' % self.name)
        root = fromstring(raw)
        self.content_types = {}
        self.default_content_types = {}
        for item in root.xpath('//*[local-name()="Types"]/*[local-name()="Default" and @Extension and @ContentType]'):
            self.default_content_types[item.get('Extension').lower()] = item.get('ContentType')
        for item in root.xpath('//*[local-name()="Types"]/*[local-name()="Override" and @PartName and @ContentType]'):
            name = item.get('PartName').lstrip('/')
            self.content_types[name] = item.get('ContentType')
    def content_type(self, name):
        if name in self.content_types:
            return self.content_types[name]
        ext = name.rpartition('.')[-1].lower()
        if ext in self.default_content_types:
            return self.default_content_types[ext]
        return guess_type(name)[0]
    def read_package_relationships(self):
        try:
            raw = self.read('_rels/.rels')
        except KeyError:
            raise InvalidDOCX('The file %s docx file has no _rels/.rels' % self.name)
        root = fromstring(raw)
        self.relationships = {}
        self.relationships_rmap = {}
        for item in root.xpath('//*[local-name()="Relationships"]/*[local-name()="Relationship" and @Type and @Target]'):
            target = item.get('Target').lstrip('/')
            typ = item.get('Type')
            if target == 'word/document.xml':
                self.docx_is_transitional = typ != 'http://purl.oclc.org/ooxml/officeDocument/relationships/officeDocument'
            self.relationships[typ] = target
            self.relationships_rmap[target] = typ
    @property
    def document_name(self):
        name = self.relationships.get(self.namespace.names['DOCUMENT'], None)
        if name is None:
            names = tuple(n for n in self.names if n == 'document.xml' or n.endswith('/document.xml'))
            if not names:
                raise InvalidDOCX('The file %s docx file has no main document' % self.name)
            name = names[0]
        return name
    @property
    def document(self):
        return fromstring(self.read(self.document_name))
    @property
    def document_relationships(self):
        return self.get_relationships(self.document_name)
    def get_relationships(self, name):
        base = '/'.join(name.split('/')[:-1])
        by_id, by_type = {}, {}
        parts = name.split('/')
        name = '/'.join(parts[:-1] + ['_rels', parts[-1] + '.rels'])
        try:
            raw = self.read(name)
        except KeyError:
            pass
        else:
            root = fromstring(raw)
            for item in root.xpath('//*[local-name()="Relationships"]/*[local-name()="Relationship" and @Type and @Target]'):
                target = item.get('Target')
                if item.get('TargetMode', None) != 'External' and not target.startswith('#'):
                    target = '/'.join((base, target.lstrip('/')))
                typ = item.get('Type')
                Id = item.get('Id')
                by_id[Id] = by_type[typ] = target
        return by_id, by_type
    def get_document_properties_names(self):
        name = self.relationships.get(self.namespace.names['DOCPROPS'], None)
        if name is None:
            names = tuple(n for n in self.names if n.lower() == 'docprops/core.xml')
            if names:
                name = names[0]
        yield name
        name = self.relationships.get(self.namespace.names['APPPROPS'], None)
        if name is None:
            names = tuple(n for n in self.names if n.lower() == 'docprops/app.xml')
            if names:
                name = names[0]
        yield name
    @property
    def metadata(self):
        mi = Metadata(_('Unknown'))
        dp_name, ap_name = self.get_document_properties_names()
        if dp_name:
            try:
                raw = self.read(dp_name)
            except KeyError:
                pass
            else:
                read_doc_props(raw, mi, self.namespace.XPath)
        if mi.is_null('language'):
            try:
                raw = self.read('word/styles.xml')
            except KeyError:
                pass
            else:
                read_default_style_language(raw, mi, self.namespace.XPath)
        ap_name = self.relationships.get(self.namespace.names['APPPROPS'], None)
        if ap_name:
            try:
                raw = self.read(ap_name)
            except KeyError:
                pass
            else:
                read_app_props(raw, mi)
        return mi
    def close(self):
        if hasattr(self, 'zipf'):
            self.zipf.close()
        else:
            try:
                shutil.rmtree(self.tdir)
            except EnvironmentError:
                pass
 if __name__ == '__main__':
    d = DOCX(sys.argv[-1], extract=False)
    print(d.metadata)
--- a/ebook_converter/ebooks/docx/fields.py
+++ b/ebook_converter/ebooks/docx/fields.py
@@ -0,0 +1,276 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import re
 from calibre.ebooks.docx.index import process_index, polish_index_markup
 from polyglot.builtins import iteritems, native_string_type
 class Field(object):
    def __init__(self, start):
        self.start = start
        self.end = None
        self.contents = []
        self.buf = []
        self.instructions = None
        self.name = None
    def add_instr(self, elem):
        self.add_raw(elem.text)
    def add_raw(self, raw):
        if not raw:
            return
        if self.name is None:
            # There are cases where partial index entries end with
            # a significant space, along the lines of
            # <>Summary <>  ...  <>Hearing<>.
            # No known examples of starting with a space yet.
            # self.name, raw = raw.strip().partition(' ')[0::2]
            self.name, raw = raw.lstrip().partition(' ')[0::2]
        self.buf.append(raw)
    def finalize(self):
        self.instructions = ''.join(self.buf)
        del self.buf
 WORD, FLAG = 0, 1
 scanner = re.Scanner([
    (r'\\\S{1}', lambda s, t: (t, FLAG)),  # A flag of the form \x
    (r'"[^"]*"', lambda s, t: (t[1:-1], WORD)),  # Quoted word
    (r'[^\s\\"]\S*', lambda s, t: (t, WORD)),  # A non-quoted word, must not start with a backslash or a space or a quote
    (r'\s+', None),
 ], flags=re.DOTALL)
 null = object()
 def parser(name, field_map, default_field_name=None):
    field_map = dict((x.split(':') for x in field_map.split()))
    def parse(raw, log=None):
        ans = {}
        last_option = None
        raw = raw.replace('\\\\', '\x01').replace('\\"', '\x02')
        for token, token_type in scanner.scan(raw)[0]:
            token = token.replace('\x01', '\\').replace('\x02', '"')
            if token_type is FLAG:
                last_option = field_map.get(token[1], null)
                if last_option is not None:
                    ans[last_option] = None
            elif token_type is WORD:
                if last_option is None:
                    ans[default_field_name] = token
                else:
                    ans[last_option] = token
                    last_option = None
        ans.pop(null, None)
        return ans
    parse.__name__ = native_string_type('parse_' + name)
    return parse
 parse_hyperlink = parser('hyperlink',
    'l:anchor m:image-map n:target o:title t:target', 'url')
 parse_xe = parser('xe',
    'b:bold i:italic f:entry-type r:page-range-bookmark t:page-number-text y:yomi', 'text')
 parse_index = parser('index',
    'b:bookmark c:columns-per-page d:sequence-separator e:first-page-number-separator'
    ' f:entry-type g:page-range-separator h:heading k:crossref-separator'
    ' l:page-number-separator p:letter-range s:sequence-name r:run-together y:yomi z:langcode')
 parse_ref = parser('ref',
    'd:separator f:footnote h:hyperlink n:number p:position r:relative-number t:suppress w:number-full-context')
 parse_noteref = parser('noteref',
                   'f:footnote h:hyperlink p:position')
 class Fields(object):
    def __init__(self, namespace):
        self.namespace = namespace
        self.fields = []
        self.index_bookmark_counter = 0
        self.index_bookmark_prefix = 'index-'
    def __call__(self, doc, log):
        all_ids = frozenset(self.namespace.XPath('//*/@w:id')(doc))
        c = 0
        while self.index_bookmark_prefix in all_ids:
            c += 1
            self.index_bookmark_prefix = self.index_bookmark_prefix.replace('-', '%d-' % c)
        stack = []
        for elem in self.namespace.XPath(
            '//*[name()="w:p" or name()="w:r" or'
            ' name()="w:instrText" or'
            ' (name()="w:fldChar" and (@w:fldCharType="begin" or @w:fldCharType="end") or'
            ' name()="w:fldSimple")]')(doc):
            if elem.tag.endswith('}fldChar'):
                typ = self.namespace.get(elem, 'w:fldCharType')
                if typ == 'begin':
                    stack.append(Field(elem))
                    self.fields.append(stack[-1])
                else:
                    try:
                        stack.pop().end = elem
                    except IndexError:
                        pass
            elif elem.tag.endswith('}instrText'):
                if stack:
                    stack[-1].add_instr(elem)
            elif elem.tag.endswith('}fldSimple'):
                field = Field(elem)
                instr = self.namespace.get(elem, 'w:instr')
                if instr:
                    field.add_raw(instr)
                    self.fields.append(field)
                    for r in self.namespace.XPath('descendant::w:r')(elem):
                        field.contents.append(r)
            else:
                if stack:
                    stack[-1].contents.append(elem)
        field_types = ('hyperlink', 'xe', 'index', 'ref', 'noteref')
        parsers = {x.upper():getattr(self, 'parse_'+x) for x in field_types}
        parsers.update({x:getattr(self, 'parse_'+x) for x in field_types})
        field_parsers = {f.upper():globals()['parse_%s' % f] for f in field_types}
        field_parsers.update({f:globals()['parse_%s' % f] for f in field_types})
        for f in field_types:
            setattr(self, '%s_fields' % f, [])
        unknown_fields = {'TOC', 'toc', 'PAGEREF', 'pageref'}  # The TOC and PAGEREF fields are handled separately
        for field in self.fields:
            field.finalize()
            if field.instructions:
                func = parsers.get(field.name, None)
                if func is not None:
                    func(field, field_parsers[field.name], log)
                elif field.name not in unknown_fields:
                    log.warn('Encountered unknown field: %s, ignoring it.' % field.name)
                    unknown_fields.add(field.name)
    def get_runs(self, field):
        all_runs = []
        current_runs = []
        # We only handle spans in a single paragraph
        # being wrapped in <a>
        for x in field.contents:
            if x.tag.endswith('}p'):
                if current_runs:
                    all_runs.append(current_runs)
                current_runs = []
            elif x.tag.endswith('}r'):
                current_runs.append(x)
        if current_runs:
            all_runs.append(current_runs)
        return all_runs
    def parse_hyperlink(self, field, parse_func, log):
        # Parse hyperlink fields
        hl = parse_func(field.instructions, log)
        if hl:
            if 'target' in hl and hl['target'] is None:
                hl['target'] = '_blank'
            for runs in self.get_runs(field):
                self.hyperlink_fields.append((hl, runs))
    def parse_ref(self, field, parse_func, log):
        ref = parse_func(field.instructions, log)
        dest = ref.get(None, None)
        if dest is not None and 'hyperlink' in ref:
            for runs in self.get_runs(field):
                self.hyperlink_fields.append(({'anchor':dest}, runs))
        else:
            log.warn('Unsupported reference field (%s), ignoring: %r' % (field.name, ref))
    parse_noteref = parse_ref
    def parse_xe(self, field, parse_func, log):
        # Parse XE fields
        if None in (field.start, field.end):
            return
        xe = parse_func(field.instructions, log)
        if xe:
            # We insert a synthetic bookmark around this index item so that we
            # can link to it later
            def WORD(x):
                return self.namespace.expand('w:' + x)
            self.index_bookmark_counter += 1
            bmark = xe['anchor'] = '%s%d' % (self.index_bookmark_prefix, self.index_bookmark_counter)
            p = field.start.getparent()
            bm = p.makeelement(WORD('bookmarkStart'))
            bm.set(WORD('id'), bmark), bm.set(WORD('name'), bmark)
            p.insert(p.index(field.start), bm)
            p = field.end.getparent()
            bm = p.makeelement(WORD('bookmarkEnd'))
            bm.set(WORD('id'), bmark)
            p.insert(p.index(field.end) + 1, bm)
            xe['start_elem'] = field.start
            self.xe_fields.append(xe)
    def parse_index(self, field, parse_func, log):
        if not field.contents:
            return
        idx = parse_func(field.instructions, log)
        hyperlinks, blocks = process_index(field, idx, self.xe_fields, log, self.namespace.XPath, self.namespace.expand)
        if not blocks:
            return
        for anchor, run in hyperlinks:
            self.hyperlink_fields.append(({'anchor':anchor}, [run]))
        self.index_fields.append((idx, blocks))
    def polish_markup(self, object_map):
        if not self.index_fields:
            return
        rmap = {v:k for k, v in iteritems(object_map)}
        for idx, blocks in self.index_fields:
            polish_index_markup(idx, [rmap[b] for b in blocks])
 def test_parse_fields(return_tests=False):
    import unittest
    class TestParseFields(unittest.TestCase):
        def test_hyperlink(self):
            ae = lambda x, y: self.assertEqual(parse_hyperlink(x, None), y)
            ae(r'\l anchor1', {'anchor':'anchor1'})
            ae(r'www.calibre-ebook.com', {'url':'www.calibre-ebook.com'})
            ae(r'www.calibre-ebook.com \t target \o tt', {'url':'www.calibre-ebook.com', 'target':'target', 'title': 'tt'})
            ae(r'"c:\\Some Folder"', {'url': 'c:\\Some Folder'})
            ae(r'xxxx \y yyyy', {'url': 'xxxx'})
        def test_xe(self):
            ae = lambda x, y: self.assertEqual(parse_xe(x, None), y)
            ae(r'"some name"', {'text':'some name'})
            ae(r'name \b \i', {'text':'name', 'bold':None, 'italic':None})
            ae(r'xxx \y a', {'text':'xxx', 'yomi':'a'})
        def test_index(self):
            ae = lambda x, y: self.assertEqual(parse_index(x, None), y)
            ae(r'', {})
            ae(r'\b \c 1', {'bookmark':None, 'columns-per-page': '1'})
    suite = unittest.TestLoader().loadTestsFromTestCase(TestParseFields)
    if return_tests:
        return suite
    unittest.TextTestRunner(verbosity=4).run(suite)
 if __name__ == '__main__':
    test_parse_fields()
--- a/ebook_converter/ebooks/docx/fonts.py
+++ b/ebook_converter/ebooks/docx/fonts.py
@@ -0,0 +1,197 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import os, re
 from collections import namedtuple
 from calibre.ebooks.docx.block_styles import binary_property, inherit
 from calibre.utils.filenames import ascii_filename
 from calibre.utils.fonts.scanner import font_scanner, NoFonts
 from calibre.utils.fonts.utils import panose_to_css_generic_family, is_truetype_font
 from calibre.utils.icu import ord_string
 from polyglot.builtins import codepoint_to_chr, iteritems, range
 Embed = namedtuple('Embed', 'name key subsetted')
 def has_system_fonts(name):
    try:
        return bool(font_scanner.fonts_for_family(name))
    except NoFonts:
        return False
 def get_variant(bold=False, italic=False):
    return {(False, False):'Regular', (False, True):'Italic',
            (True, False):'Bold', (True, True):'BoldItalic'}[(bold, italic)]
 def find_fonts_matching(fonts, style='normal', stretch='normal'):
    for font in fonts:
        if font['font-style'] == style and font['font-stretch'] == stretch:
            yield font
 def weight_key(font):
    w = font['font-weight']
    try:
        return abs(int(w) - 400)
    except Exception:
        return abs({'normal': 400, 'bold': 700}.get(w, 1000000) - 400)
 def get_best_font(fonts, style, stretch):
    try:
        return sorted(find_fonts_matching(fonts, style, stretch), key=weight_key)[0]
    except Exception:
        pass
 class Family(object):
    def __init__(self, elem, embed_relationships, XPath, get):
        self.name = self.family_name = get(elem, 'w:name')
        self.alt_names = tuple(get(x, 'w:val') for x in XPath('./w:altName')(elem))
        if self.alt_names and not has_system_fonts(self.name):
            for x in self.alt_names:
                if has_system_fonts(x):
                    self.family_name = x
                    break
        self.embedded = {}
        for x in ('Regular', 'Bold', 'Italic', 'BoldItalic'):
            for y in XPath('./w:embed%s[@r:id]' % x)(elem):
                rid = get(y, 'r:id')
                key = get(y, 'w:fontKey')
                subsetted = get(y, 'w:subsetted') in {'1', 'true', 'on'}
                if rid in embed_relationships:
                    self.embedded[x] = Embed(embed_relationships[rid], key, subsetted)
        self.generic_family = 'auto'
        for x in XPath('./w:family[@w:val]')(elem):
            self.generic_family = get(x, 'w:val', 'auto')
        ntt = binary_property(elem, 'notTrueType', XPath, get)
        self.is_ttf = ntt is inherit or not ntt
        self.panose1 = None
        self.panose_name = None
        for x in XPath('./w:panose1[@w:val]')(elem):
            try:
                v = get(x, 'w:val')
                v = tuple(int(v[i:i+2], 16) for i in range(0, len(v), 2))
            except (TypeError, ValueError, IndexError):
                pass
            else:
                self.panose1 = v
                self.panose_name = panose_to_css_generic_family(v)
        self.css_generic_family = {'roman':'serif', 'swiss':'sans-serif', 'modern':'monospace',
                                   'decorative':'fantasy', 'script':'cursive'}.get(self.generic_family, None)
        self.css_generic_family = self.css_generic_family or self.panose_name or 'serif'
 SYMBOL_MAPS = {  # {{{
    'Wingdings': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🖉', '✂', '✁', '👓', '🕭', '🕮', '🕯', '🕿', '✆', '🖂', '🖃', '📪', '📫', '📬', '📭', '🗀', '🗁', '🗎', '🗏', '🗐', '🗄', '⏳', '🖮', '🖰', '🖲', '🖳', '🖴', '🖫', '🖬', '✇', '✍', '🖎', '✌', '🖏', '👍', '👎', '☜', '☞', '☜', '🖗', '🖐', '☺', '😐', '☹', '💣', '🕱', '🏳', '🏱', '✈', '☼', '🌢', '❄', '🕆', '✞', '🕈', '✠', '✡', '☪', '☯', '🕉', '☸', '♈', '♉', '♊', '♋', '♌', '♍', '♎', '♏', '♐', '♑', '♒', '♓', '🙰', '🙵', '⚫', '🔾', '◼', '🞏', '🞐', '❑', '❒', '🞟', '⧫', '◆', '❖', '🞙', '⌧', '⮹', '⌘', '🏵', '🏶', '🙶', '🙷', ' ', '🄋', '➀', '➁', '➂', '➃', '➄', '➅', '➆', '➇', '➈', '➉', '🄌', '➊', '➋', '➌', '➍', '➎', '➏', '➐', '➑', '➒', '➓', '🙢', '🙠', '🙡', '🙣', '🙦', '🙤', '🙥', '🙧', '∙', '•', '⬝', '⭘', '🞆', '🞈', '🞊', '🞋', '🔿', '▪', '🞎', '🟀', '🟁', '★', '🟋', '🟏', '🟓', '🟑', '⯐', '⌖', '⯎', '⯏', '⯑', '✪', '✰', '🕐', '🕑', '🕒', '🕓', '🕔', '🕕', '🕖', '🕗', '🕘', '🕙', '🕚', '🕛', '⮰', '⮱', '⮲', '⮳', '⮴', '⮵', '⮶', '⮷', '🙪', '🙫', '🙕', '🙔', '🙗', '🙖', '🙐', '🙑', '🙒', '🙓', '⌫', '⌦', '⮘', '⮚', '⮙', '⮛', '⮈', '⮊', '⮉', '⮋', '🡨', '🡪', '🡩', '🡫', '🡬', '🡭', '🡯', '🡮', '🡸', '🡺', '🡹', '🡻', '🡼', '🡽', '🡿', '🡾', '⇦', '⇨', '⇧', '⇩', '⬄', '⇳', '⬁', '⬀', '⬃', '⬂', '🢬', '🢭', '🗶', '✓', '🗷', '🗹', ' '),  # noqa
    'Wingdings 2': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🖊', '🖋', '🖌', '🖍', '✄', '✀', '🕾', '🕽', '🗅', '🗆', '🗇', '🗈', '🗉', '🗊', '🗋', '🗌', '🗍', '📋', '🗑', '🗔', '🖵', '🖶', '🖷', '🖸', '🖭', '🖯', '🖱', '🖒', '🖓', '🖘', '🖙', '🖚', '🖛', '👈', '👉', '🖜', '🖝', '🖞', '🖟', '🖠', '🖡', '👆', '👇', '🖢', '🖣', '🖑', '🗴', '🗸', '🗵', '☑', '⮽', '☒', '⮾', '⮿', '🛇', '⦸', '🙱', '🙴', '🙲', '🙳', '‽', '🙹', '🙺', '🙻', '🙦', '🙤', '🙥', '🙧', '🙚', '🙘', '🙙', '🙛', '⓪', '①', '②', '③', '④', '⑤', '⑥', '⑦', '⑧', '⑨', '⑩', '⓿', '❶', '❷', '❸', '❹', '❺', '❻', '❼', '❽', '❾', '❿', ' ', '☉', '🌕', '☽', '☾', '⸿', '✝', '🕇', '🕜', '🕝', '🕞', '🕟', '🕠', '🕡', '🕢', '🕣', '🕤', '🕥', '🕦', '🕧', '🙨', '🙩', '⋅', '🞄', '⦁', '●', '●', '🞅', '🞇', '🞉', '⊙', '⦿', '🞌', '🞍', '◾', '■', '□', '🞑', '🞒', '🞓', '🞔', '▣', '🞕', '🞖', '🞗', '🞘', '⬩', '⬥', '◇', '🞚', '◈', '🞛', '🞜', '🞝', '🞞', '⬪', '⬧', '◊', '🞠', '◖', '◗', '⯊', '⯋', '⯀', '⯁', '⬟', '⯂', '⬣', '⬢', '⯃', '⯄', '🞡', '🞢', '🞣', '🞤', '🞥', '🞦', '🞧', '🞨', '🞩', '🞪', '🞫', '🞬', '🞭', '🞮', '🞯', '🞰', '🞱', '🞲', '🞳', '🞴', '🞵', '🞶', '🞷', '🞸', '🞹', '🞺', '🞻', '🞼', '🞽', '🞾', '🞿', '🟀', '🟂', '🟄', '🟆', '🟉', '🟊', '✶', '🟌', '🟎', '🟐', '🟒', '✹', '🟃', '🟇', '✯', '🟍', '🟔', '⯌', '⯍', '※', '⁂', ' ', ' ', ' ', ' ', ' ', ' ',),  # noqa
    'Wingdings 3': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '⭠', '⭢', '⭡', '⭣', '⭤', '⭥', '⭧', '⭦', '⭰', '⭲', '⭱', '⭳', '⭶', '⭸', '⭻', '⭽', '⭤', '⭥', '⭪', '⭬', '⭫', '⭭', '⭍', '⮠', '⮡', '⮢', '⮣', '⮤', '⮥', '⮦', '⮧', '⮐', '⮑', '⮒', '⮓', '⮀', '⮃', '⭾', '⭿', '⮄', '⮆', '⮅', '⮇', '⮏', '⮍', '⮎', '⮌', '⭮', '⭯', '⎋', '⌤', '⌃', '⌥', '␣', '⍽', '⇪', '⮸', '🢠', '🢡', '🢢', '🢣', '🢤', '🢥', '🢦', '🢧', '🢨', '🢩', '🢪', '🢫', '🡐', '🡒', '🡑', '🡓', '🡔', '🡕', '🡗', '🡖', '🡘', '🡙', '▲', '▼', '△', '▽', '◀', '▶', '◁', '▷', '◣', '◢', '◤', '◥', '🞀', '🞂', '🞁', ' ', '🞃', '⯅', '⯆', '⯇', '⯈', '⮜', '⮞', '⮝', '⮟', '🠐', '🠒', '🠑', '🠓', '🠔', '🠖', '🠕', '🠗', '🠘', '🠚', '🠙', '🠛', '🠜', '🠞', '🠝', '🠟', '🠀', '🠂', '🠁', '🠃', '🠄', '🠆', '🠅', '🠇', '🠈', '🠊', '🠉', '🠋', '🠠', '🠢', '🠤', '🠦', '🠨', '🠪', '🠬', '🢜', '🢝', '🢞', '🢟', '🠮', '🠰', '🠲', '🠴', '🠶', '🠸', '🠺', '🠹', '🠻', '🢘', '🢚', '🢙', '🢛', '🠼', '🠾', '🠽', '🠿', '🡀', '🡂', '🡁', '🡃', '🡄', '🡆', '🡅', '🡇', '⮨', '⮩', '⮪', '⮫', '⮬', '⮭', '⮮', '⮯', '🡠', '🡢', '🡡', '🡣', '🡤', '🡥', '🡧', '🡦', '🡰', '🡲', '🡱', '🡳', '🡴', '🡵', '🡷', '🡶', '🢀', '🢂', '🢁', '🢃', '🢄', '🢅', '🢇', '🢆', '🢐', '🢒', '🢑', '🢓', '🢔', '🢕', '🢗', '🢖', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',),  # noqa
    'Webdings': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '🕷', '🕸', '🕲', '🕶', '🏆', '🎖', '🖇', '🗨', '🗩', '🗰', '🗱', '🌶', '🎗', '🙾', '🙼', '🗕', '🗖', '🗗', '⏴', '⏵', '⏶', '⏷', '⏪', '⏩', '⏮', '⏭', '⏸', '⏹', '⏺', '🗚', '🗳', '🛠', '🏗', '🏘', '🏙', '🏚', '🏜', '🏭', '🏛', '🏠', '🏖', '🏝', '🛣', '🔍', '🏔', '👁', '👂', '🏞', '🏕', '🛤', '🏟', '🛳', '🕬', '🕫', '🕨', '🔈', '🎔', '🎕', '🗬', '🙽', '🗭', '🗪', '🗫', '⮔', '✔', '🚲', '⬜', '🛡', '📦', '🛱', '⬛', '🚑', '🛈', '🛩', '🛰', '🟈', '🕴', '⬤', '🛥', '🚔', '🗘', '🗙', '❓', '🛲', '🚇', '🚍', '⛳', '⦸', '⊖', '🚭', '🗮', '⏐', '🗯', '🗲', ' ', '🚹', '🚺', '🛉', '🛊', '🚼', '👽', '🏋', '⛷', '🏂', '🏌', '🏊', '🏄', '🏍', '🏎', '🚘', '🗠', '🛢', '📠', '🏷', '📣', '👪', '🗡', '🗢', '🗣', '✯', '🖄', '🖅', '🖃', '🖆', '🖹', '🖺', '🖻', '🕵', '🕰', '🖽', '🖾', '📋', '🗒', '🗓', '🕮', '📚', '🗞', '🗟', '🗃', '🗂', '🖼', '🎭', '🎜', '🎘', '🎙', '🎧', '💿', '🎞', '📷', '🎟', '🎬', '📽', '📹', '📾', '📻', '🎚', '🎛', '📺', '💻', '🖥', '🖦', '🖧', '🍹', '🎮', '🎮', '🕻', '🕼', '🖁', '🖀', '🖨', '🖩', '🖿', '🖪', '🗜', '🔒', '🔓', '🗝', '📥', '📤', '🕳', '🌣', '🌤', '🌥', '🌦', '☁', '🌨', '🌧', '🌩', '🌪', '🌬', '🌫', '🌜', '🌡', '🛋', '🛏', '🍽', '🍸', '🛎', '🛍', 'Ⓟ', '♿', '🛆', '🖈', '🎓', '🗤', '🗥', '🗦', '🗧', '🛪', '🐿', '🐦', '🐟', '🐕', '🐈', '🙬', '🙮', '🙭', '🙯', '🗺', '🌍', '🌏', '🌎', '🕊',),  # noqa
    'Symbol': (' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', '∀', '#', '∃', '%', '&', '∍', '(', ')', '*', '+', ',', '−', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '≅', 'Α', 'Β', 'Χ', 'Δ', 'Ε', 'Φ', 'Γ', 'Η', 'Ι', 'ϑ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π', 'Θ', 'Ρ', 'Σ', 'Τ', 'Υ', 'ς', 'Ω', 'Ξ', 'Ψ', 'Ζ', '[', '∴', ']', '⊥', '_', '', 'α', 'β', 'χ', 'δ', 'ε', 'φ', 'γ', 'η', 'ι', 'ϕ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'θ', 'ρ', 'σ', 'τ', 'υ', 'ϖ', 'ω', 'ξ', 'ψ', 'ζ', '{', '|', '}', '~', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '€', 'ϒ', '′', '≤', '⁄', '∞', 'ƒ', '♣', '♥', '♦', '♠', '↔', '←', '↑', '→', '↓', '°', '±', '″', '≥', '×', '∝', '∂', '•', '÷', '≠', '≡', '≈', '…', '⏐', '⎯', '↲', 'ℵ', 'ℑ', 'ℜ', '℘', '⊗', '⊕', '∅', '∩', '∪', '⊃', '⊇', '⊄', '⊂', '⊆', '∈', '∉', '∠', '∂', '®', '©', '™', '∏', '√', '⋅', '¬', '∦', '∧', '⇔', '⇐', '⇑', '⇒', '⇓', '◊', '〈', '®', '©', '™', '∑', '⎛', '⎜', '⎝', '⎡', '⎢', '⎣', '⎧', '⎨', '⎩', '⎪', ' ', '〉', '∫', '⌠', '⎮', '⌡', '⎞', '⎟', '⎠', '⎤', '⎥', '⎦', '⎪', '⎫', '⎬', ' ',),  # noqa
 }  # }}}
 SYMBOL_FONT_NAMES = frozenset(n.lower() for n in SYMBOL_MAPS)
 def is_symbol_font(family):
    try:
        return family.lower() in SYMBOL_FONT_NAMES
    except AttributeError:
        return False
 def do_map(m, points):
    base = 0xf000
    limit = len(m) + base
    for p in points:
        if base < p < limit:
            yield m[p - base]
        else:
            yield codepoint_to_chr(p)
 def map_symbol_text(text, font):
    m = SYMBOL_MAPS[font]
    if isinstance(text, bytes):
        text = text.decode('utf-8')
    return ''.join(do_map(m, ord_string(text)))
 class Fonts(object):
    def __init__(self, namespace):
        self.namespace = namespace
        self.fonts = {}
        self.used = set()
    def __call__(self, root, embed_relationships, docx, dest_dir):
        for elem in self.namespace.XPath('//w:font[@w:name]')(root):
            self.fonts[self.namespace.get(elem, 'w:name')] = Family(elem, embed_relationships, self.namespace.XPath, self.namespace.get)
    def family_for(self, name, bold=False, italic=False):
        f = self.fonts.get(name, None)
        if f is None:
            return 'serif'
        variant = get_variant(bold, italic)
        self.used.add((name, variant))
        name = f.name if variant in f.embedded else f.family_name
        if is_symbol_font(name):
            return name
        return '"%s", %s' % (name.replace('"', ''), f.css_generic_family)
    def embed_fonts(self, dest_dir, docx):
        defs = []
        dest_dir = os.path.join(dest_dir, 'fonts')
        for name, variant in self.used:
            f = self.fonts[name]
            if variant in f.embedded:
                if not os.path.exists(dest_dir):
                    os.mkdir(dest_dir)
                fname = self.write(name, dest_dir, docx, variant)
                if fname is not None:
                    d = {'font-family':'"%s"' % name.replace('"', ''), 'src': 'url("fonts/%s")' % fname}
                    if 'Bold' in variant:
                        d['font-weight'] = 'bold'
                    if 'Italic' in variant:
                        d['font-style'] = 'italic'
                    d = ['%s: %s' % (k, v) for k, v in iteritems(d)]
                    d = ';\n\t'.join(d)
                    defs.append('@font-face {\n\t%s\n}\n' % d)
        return '\n'.join(defs)
    def write(self, name, dest_dir, docx, variant):
        f = self.fonts[name]
        ef = f.embedded[variant]
        raw = docx.read(ef.name)
        prefix = raw[:32]
        if ef.key:
            key = re.sub(r'[^A-Fa-f0-9]', '', ef.key)
            key = bytearray(reversed(tuple(int(key[i:i+2], 16) for i in range(0, len(key), 2))))
            prefix = bytearray(prefix)
            prefix = bytes(bytearray(prefix[i]^key[i % len(key)] for i in range(len(prefix))))
        if not is_truetype_font(prefix):
            return None
        ext = 'otf' if prefix.startswith(b'OTTO') else 'ttf'
        fname = ascii_filename('%s - %s.%s' % (name, variant, ext))
        with open(os.path.join(dest_dir, fname), 'wb') as dest:
            dest.write(prefix)
            dest.write(raw[32:])
        return fname
--- a/ebook_converter/ebooks/docx/footnotes.py
+++ b/ebook_converter/ebooks/docx/footnotes.py
@@ -0,0 +1,65 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from collections import OrderedDict
 from polyglot.builtins import iteritems, unicode_type
 class Note(object):
    def __init__(self, namespace, parent, rels):
        self.type = namespace.get(parent, 'w:type', 'normal')
        self.parent = parent
        self.rels = rels
        self.namespace = namespace
    def __iter__(self):
        for p in self.namespace.descendants(self.parent, 'w:p', 'w:tbl'):
            yield p
 class Footnotes(object):
    def __init__(self, namespace):
        self.namespace = namespace
        self.footnotes = {}
        self.endnotes = {}
        self.counter = 0
        self.notes = OrderedDict()
    def __call__(self, footnotes, footnotes_rels, endnotes, endnotes_rels):
        XPath, get = self.namespace.XPath, self.namespace.get
        if footnotes is not None:
            for footnote in XPath('./w:footnote[@w:id]')(footnotes):
                fid = get(footnote, 'w:id')
                if fid:
                    self.footnotes[fid] = Note(self.namespace, footnote, footnotes_rels)
        if endnotes is not None:
            for endnote in XPath('./w:endnote[@w:id]')(endnotes):
                fid = get(endnote, 'w:id')
                if fid:
                    self.endnotes[fid] = Note(self.namespace, endnote, endnotes_rels)
    def get_ref(self, ref):
        fid = self.namespace.get(ref, 'w:id')
        notes = self.footnotes if ref.tag.endswith('}footnoteReference') else self.endnotes
        note = notes.get(fid, None)
        if note is not None and note.type == 'normal':
            self.counter += 1
            anchor = 'note_%d' % self.counter
            self.notes[anchor] = (unicode_type(self.counter), note)
            return anchor, unicode_type(self.counter)
        return None, None
    def __iter__(self):
        for anchor, (counter, note) in iteritems(self.notes):
            yield anchor, counter, note
    @property
    def has_notes(self):
        return bool(self.notes)
--- a/ebook_converter/ebooks/docx/images.py
+++ b/ebook_converter/ebooks/docx/images.py
@@ -0,0 +1,343 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import os
 from lxml.html.builder import IMG, HR
 from calibre.constants import iswindows
 from calibre.ebooks.docx.names import barename
 from calibre.utils.filenames import ascii_filename
 from calibre.utils.img import resize_to_fit, image_to_data
 from calibre.utils.imghdr import what
 from polyglot.builtins import iteritems, itervalues
 class LinkedImageNotFound(ValueError):
    def __init__(self, fname):
        ValueError.__init__(self, fname)
        self.fname = fname
 def image_filename(x):
    return ascii_filename(x).replace(' ', '_').replace('#', '_')
 def emu_to_pt(x):
    return x / 12700
 def pt_to_emu(x):
    return int(x * 12700)
 def get_image_properties(parent, XPath, get):
    width = height = None
    for extent in XPath('./wp:extent')(parent):
        try:
            width = emu_to_pt(int(extent.get('cx')))
        except (TypeError, ValueError):
            pass
        try:
            height = emu_to_pt(int(extent.get('cy')))
        except (TypeError, ValueError):
            pass
    ans = {}
    if width is not None:
        ans['width'] = '%.3gpt' % width
    if height is not None:
        ans['height'] = '%.3gpt' % height
    alt = None
    title = None
    for docPr in XPath('./wp:docPr')(parent):
        alt = docPr.get('descr') or alt
        title = docPr.get('title') or title
        if docPr.get('hidden', None) in {'true', 'on', '1'}:
            ans['display'] = 'none'
    return ans, alt, title
 def get_image_margins(elem):
    ans = {}
    for w, css in iteritems({'L':'left', 'T':'top', 'R':'right', 'B':'bottom'}):
        val = elem.get('dist%s' % w, None)
        if val is not None:
            try:
                val = emu_to_pt(val)
            except (TypeError, ValueError):
                continue
            ans['padding-%s' % css] = '%.3gpt' % val
    return ans
 def get_hpos(anchor, page_width, XPath, get, width_frac):
    for ph in XPath('./wp:positionH')(anchor):
        rp = ph.get('relativeFrom', None)
        if rp == 'leftMargin':
            return 0 + width_frac
        if rp == 'rightMargin':
            return 1 + width_frac
        al = None
        almap = {'left':0, 'center':0.5, 'right':1}
        for align in XPath('./wp:align')(ph):
            al = almap.get(align.text)
            if al is not None:
                if rp == 'page':
                    return al
                return al + width_frac
        for po in XPath('./wp:posOffset')(ph):
            try:
                pos = emu_to_pt(int(po.text))
            except (TypeError, ValueError):
                continue
            return pos/page_width + width_frac
    for sp in XPath('./wp:simplePos')(anchor):
        try:
            x = emu_to_pt(sp.get('x', None))
        except (TypeError, ValueError):
            continue
        return x/page_width + width_frac
    return 0
 class Images(object):
    def __init__(self, namespace, log):
        self.namespace = namespace
        self.rid_map = {}
        self.used = {}
        self.resized = {}
        self.names = set()
        self.all_images = set()
        self.links = []
        self.log = log
    def __call__(self, relationships_by_id):
        self.rid_map = relationships_by_id
    def read_image_data(self, fname, base=None):
        if fname.startswith('file://'):
            src = fname[len('file://'):]
            if iswindows and src and src[0] == '/':
                src = src[1:]
            if not src or not os.path.exists(src):
                raise LinkedImageNotFound(src)
            with open(src, 'rb') as rawsrc:
                raw = rawsrc.read()
        else:
            try:
                raw = self.docx.read(fname)
            except KeyError:
                raise LinkedImageNotFound(fname)
        base = base or image_filename(fname.rpartition('/')[-1]) or 'image'
        ext = what(None, raw) or base.rpartition('.')[-1] or 'jpeg'
        if ext == 'emf':
            # For an example, see: https://bugs.launchpad.net/bugs/1224849
            self.log('Found an EMF image: %s, trying to extract embedded raster image' % fname)
            from calibre.utils.wmf.emf import emf_unwrap
            try:
                raw = emf_unwrap(raw)
            except Exception:
                self.log.exception('Failed to extract embedded raster image from EMF')
            else:
                ext = 'png'
        base = base.rpartition('.')[0]
        if not base:
            base = 'image'
        base += '.' + ext
        return raw, base
    def unique_name(self, base):
        exists = frozenset(itervalues(self.used))
        c = 1
        name = base
        while name in exists:
            n, e = base.rpartition('.')[0::2]
            name = '%s-%d.%s' % (n, c, e)
            c += 1
        return name
    def resize_image(self, raw, base, max_width, max_height):
        resized, img = resize_to_fit(raw, max_width, max_height)
        if resized:
            base, ext = os.path.splitext(base)
            base = base + '-%dx%d%s' % (max_width, max_height, ext)
            raw = image_to_data(img, fmt=ext[1:])
        return raw, base, resized
    def generate_filename(self, rid, base=None, rid_map=None, max_width=None, max_height=None):
        rid_map = self.rid_map if rid_map is None else rid_map
        fname = rid_map[rid]
        key = (fname, max_width, max_height)
        ans = self.used.get(key)
        if ans is not None:
            return ans
        raw, base = self.read_image_data(fname, base=base)
        resized = False
        if max_width is not None and max_height is not None:
            raw, base, resized = self.resize_image(raw, base, max_width, max_height)
        name = self.unique_name(base)
        self.used[key] = name
        if max_width is not None and max_height is not None and not resized:
            okey = (fname, None, None)
            if okey in self.used:
                return self.used[okey]
            self.used[okey] = name
        with open(os.path.join(self.dest_dir, name), 'wb') as f:
            f.write(raw)
        self.all_images.add('images/' + name)
        return name
    def pic_to_img(self, pic, alt, parent, title):
        XPath, get = self.namespace.XPath, self.namespace.get
        name = None
        link = None
        for hl in XPath('descendant::a:hlinkClick[@r:id]')(parent):
            link = {'id':get(hl, 'r:id')}
            tgt = hl.get('tgtFrame', None)
            if tgt:
                link['target'] = tgt
            title = hl.get('tooltip', None)
            if title:
                link['title'] = title
        for pr in XPath('descendant::pic:cNvPr')(pic):
            name = pr.get('name', None)
            if name:
                name = image_filename(name)
            alt = pr.get('descr') or alt
            for a in XPath('descendant::a:blip[@r:embed or @r:link]')(pic):
                rid = get(a, 'r:embed')
                if not rid:
                    rid = get(a, 'r:link')
                if rid and rid in self.rid_map:
                    try:
                        src = self.generate_filename(rid, name)
                    except LinkedImageNotFound as err:
                        self.log.warn('Linked image: %s not found, ignoring' % err.fname)
                        continue
                    img = IMG(src='images/%s' % src)
                    img.set('alt', alt or 'Image')
                    if title:
                        img.set('title', title)
                    if link is not None:
                        self.links.append((img, link, self.rid_map))
                    return img
    def drawing_to_html(self, drawing, page):
        XPath, get = self.namespace.XPath, self.namespace.get
        # First process the inline pictures
        for inline in XPath('./wp:inline')(drawing):
            style, alt, title = get_image_properties(inline, XPath, get)
            for pic in XPath('descendant::pic:pic')(inline):
                ans = self.pic_to_img(pic, alt, inline, title)
                if ans is not None:
                    if style:
                        ans.set('style', '; '.join('%s: %s' % (k, v) for k, v in iteritems(style)))
                    yield ans
        # Now process the floats
        for anchor in XPath('./wp:anchor')(drawing):
            style, alt, title = get_image_properties(anchor, XPath, get)
            self.get_float_properties(anchor, style, page)
            for pic in XPath('descendant::pic:pic')(anchor):
                ans = self.pic_to_img(pic, alt, anchor, title)
                if ans is not None:
                    if style:
                        ans.set('style', '; '.join('%s: %s' % (k, v) for k, v in iteritems(style)))
                    yield ans
    def pict_to_html(self, pict, page):
        XPath, get = self.namespace.XPath, self.namespace.get
        # First see if we have an <hr>
        is_hr = len(pict) == 1 and get(pict[0], 'o:hr') in {'t', 'true'}
        if is_hr:
            style = {}
            hr = HR()
            try:
                pct = float(get(pict[0], 'o:hrpct'))
            except (ValueError, TypeError, AttributeError):
                pass
            else:
                if pct > 0:
                    style['width'] = '%.3g%%' % pct
            align = get(pict[0], 'o:hralign', 'center')
            if align in {'left', 'right'}:
                style['margin-left'] = '0' if align == 'left' else 'auto'
                style['margin-right'] = 'auto' if align == 'left' else '0'
            if style:
                hr.set('style', '; '.join(('%s:%s' % (k, v) for k, v in iteritems(style))))
            yield hr
        for imagedata in XPath('descendant::v:imagedata[@r:id]')(pict):
            rid = get(imagedata, 'r:id')
            if rid in self.rid_map:
                try:
                    src = self.generate_filename(rid)
                except LinkedImageNotFound as err:
                    self.log.warn('Linked image: %s not found, ignoring' % err.fname)
                    continue
                img = IMG(src='images/%s' % src, style="display:block")
                alt = get(imagedata, 'o:title')
                img.set('alt', alt or 'Image')
                yield img
    def get_float_properties(self, anchor, style, page):
        XPath, get = self.namespace.XPath, self.namespace.get
        if 'display' not in style:
            style['display'] = 'block'
        padding = get_image_margins(anchor)
        width = float(style.get('width', '100pt')[:-2])
        page_width = page.width - page.margin_left - page.margin_right
        if page_width <= 0:
            # Ignore margins
            page_width = page.width
        hpos = get_hpos(anchor, page_width, XPath, get, width/(2*page_width))
        wrap_elem = None
        dofloat = False
        for child in reversed(anchor):
            bt = barename(child.tag)
            if bt in {'wrapNone', 'wrapSquare', 'wrapThrough', 'wrapTight', 'wrapTopAndBottom'}:
                wrap_elem = child
                dofloat = bt not in {'wrapNone', 'wrapTopAndBottom'}
                break
        if wrap_elem is not None:
            padding.update(get_image_margins(wrap_elem))
            wt = wrap_elem.get('wrapText', None)
            hpos = 0 if wt == 'right' else 1 if wt == 'left' else hpos
            if dofloat:
                style['float'] = 'left' if hpos < 0.65 else 'right'
            else:
                ml, mr = (None, None) if hpos < 0.34 else ('auto', None) if hpos > 0.65 else ('auto', 'auto')
                if ml is not None:
                    style['margin-left'] = ml
                if mr is not None:
                    style['margin-right'] = mr
        style.update(padding)
    def to_html(self, elem, page, docx, dest_dir):
        dest = os.path.join(dest_dir, 'images')
        if not os.path.exists(dest):
            os.mkdir(dest)
        self.dest_dir, self.docx = dest, docx
        if elem.tag.endswith('}drawing'):
            for tag in self.drawing_to_html(elem, page):
                yield tag
        else:
            for tag in self.pict_to_html(elem, page):
                yield tag
--- a/ebook_converter/ebooks/docx/index.py
+++ b/ebook_converter/ebooks/docx/index.py
@@ -0,0 +1,273 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2014, Kovid Goyal <kovid at kovidgoyal.net>'
 from operator import itemgetter
 from lxml import etree
 from calibre.utils.icu import partition_by_first_letter, sort_key
 from polyglot.builtins import iteritems, filter
 def get_applicable_xe_fields(index, xe_fields, XPath, expand):
    iet = index.get('entry-type', None)
    xe_fields = [xe for xe in xe_fields if xe.get('entry-type', None) == iet]
    lr = index.get('letter-range', None)
    if lr is not None:
        sl, el = lr.parition('-')[0::2]
        sl, el = sl.strip(), el.strip()
        if sl and el:
            def inrange(text):
                return sl <= text[0] <= el
            xe_fields = [xe for xe in xe_fields if inrange(xe.get('text', ''))]
    bmark = index.get('bookmark', None)
    if bmark is None:
        return xe_fields
    attr = expand('w:name')
    bookmarks = {b for b in XPath('//w:bookmarkStart')(xe_fields[0]['start_elem']) if b.get(attr, None) == bmark}
    ancestors = XPath('ancestor::w:bookmarkStart')
    def contained(xe):
        # Check if the xe field is contained inside a bookmark with the
        # specified name
        return bool(set(ancestors(xe['start_elem'])) & bookmarks)
    return [xe for xe in xe_fields if contained(xe)]
 def make_block(expand, style, parent, pos):
    p = parent.makeelement(expand('w:p'))
    parent.insert(pos, p)
    if style is not None:
        ppr = p.makeelement(expand('w:pPr'))
        p.append(ppr)
        ps = ppr.makeelement(expand('w:pStyle'))
        ppr.append(ps)
        ps.set(expand('w:val'), style)
    r = p.makeelement(expand('w:r'))
    p.append(r)
    t = r.makeelement(expand('w:t'))
    t.set(expand('xml:space'), 'preserve')
    r.append(t)
    return p, t
 def add_xe(xe, t, expand):
    run = t.getparent()
    idx = run.index(t)
    t.text = xe.get('text') or ' '
    pt = xe.get('page-number-text', None)
    if pt:
        p = t.getparent().getparent()
        r = p.makeelement(expand('w:r'))
        p.append(r)
        t2 = r.makeelement(expand('w:t'))
        t2.set(expand('xml:space'), 'preserve')
        t2.text = ' [%s]' % pt
        r.append(t2)
    # put separate entries on separate lines
    run.insert(idx + 1, run.makeelement(expand('w:br')))
    return xe['anchor'], run
 def process_index(field, index, xe_fields, log, XPath, expand):
    '''
    We remove all the word generated index markup and replace it with our own
    that is more suitable for an ebook.
    '''
    styles = []
    heading_text = index.get('heading', None)
    heading_style = 'IndexHeading'
    start_pos = None
    for elem in field.contents:
        if elem.tag.endswith('}p'):
            s = XPath('descendant::pStyle/@w:val')(elem)
            if s:
                styles.append(s[0])
            p = elem.getparent()
            if start_pos is None:
                start_pos = (p, p.index(elem))
            p.remove(elem)
    xe_fields = get_applicable_xe_fields(index, xe_fields, XPath, expand)
    if not xe_fields:
        return [], []
    if heading_text is not None:
        groups = partition_by_first_letter(xe_fields, key=itemgetter('text'))
        items = []
        for key, fields in iteritems(groups):
            items.append(key), items.extend(fields)
        if styles:
            heading_style = styles[0]
    else:
        items = sorted(xe_fields, key=lambda x:sort_key(x['text']))
    hyperlinks = []
    blocks = []
    for item in reversed(items):
        is_heading = not isinstance(item, dict)
        style = heading_style if is_heading else None
        p, t = make_block(expand, style, *start_pos)
        if is_heading:
            text = heading_text
            if text.lower().startswith('a'):
                text = item + text[1:]
            t.text = text
        else:
            hyperlinks.append(add_xe(item, t, expand))
            blocks.append(p)
    return hyperlinks, blocks
 def split_up_block(block, a, text, parts, ldict):
    prefix = parts[:-1]
    a.text = parts[-1]
    parent = a.getparent()
    style = 'display:block; margin-left: %.3gem'
    for i, prefix in enumerate(prefix):
        m = 1.5 * i
        span = parent.makeelement('span', style=style % m)
        ldict[span]    = i
        parent.append(span)
        span.text = prefix
    span = parent.makeelement('span', style=style % ((i + 1) * 1.5))
    parent.append(span)
    span.append(a)
    ldict[span]    = len(prefix)
 """
 The merge algorithm is a little tricky.
 We start with a list of elementary blocks. Each is an HtmlElement, a p node
 with a list of child nodes. The last child may be a link, and the earlier ones are
 just text.
 The list is in reverse order from what we want in the index.
 There is a dictionary ldict which records the level of each child node.
 Now we want to do a reduce-like operation, combining all blocks with the same
 top level index entry into a single block representing the structure of all
 references, subentries, etc. under that top entry.
 Here's the algorithm.
 Given a block p and the next block n, and the top level entries p1 and n1 in each
 block, which we assume have the same text:
 Start with (p, p1) and (n, n1).
 Given (p, p1, ..., pk) and (n, n1, ..., nk) which we want to merge:
 If there are no more levels in n, and we have a link in nk,
 then add the link from nk to the links for pk.
 This might be the first link for pk, or we might get a list of references.
 Otherwise nk+1 is the next level in n. Look for a matching entry in p. It must have
 the same text, it must follow pk, it must come before we find any other p entries at
 the same level as pk, and it must have the same level as nk+1.
 If we find such a matching entry, go back to the start with (p ... pk+1) and (n ... nk+1).
 If there is no matching entry, then because of the original reversed order we want
 to insert nk+1 and all following entries from n into p immediately following pk.
 """
 def find_match(prev_block, pind, nextent, ldict):
    curlevel = ldict.get(prev_block[pind], -1)
    if curlevel < 0:
        return -1
    for p in range(pind+1, len(prev_block)):
        trylev = ldict.get(prev_block[p], -1)
        if trylev <= curlevel:
            return -1
        if trylev > (curlevel+1):
            continue
        if prev_block[p].text_content() == nextent.text_content():
            return p
    return -1
 def add_link(pent, nent, ldict):
    na = nent.xpath('descendant::a[1]')
    # If there is no link, leave it as text
    if not na or len(na) == 0:
        return
    na = na[0]
    pa = pent.xpath('descendant::a')
    if pa and len(pa) > 0:
        # Put on same line with a comma
        pa = pa[-1]
        pa.tail = ', '
        p = pa.getparent()
        p.insert(p.index(pa) + 1, na)
    else:
        # substitute link na for plain text in pent
        pent.text = ""
        pent.append(na)
 def merge_blocks(prev_block, next_block, pind, nind, next_path, ldict):
    # First elements match. Any more in next?
    if len(next_path) == (nind + 1):
        nextent = next_block[nind]
        add_link(prev_block[pind], nextent, ldict)
        return
    nind = nind + 1
    nextent = next_block[nind]
    prevent = find_match(prev_block, pind, nextent, ldict)
    if prevent > 0:
        merge_blocks(prev_block, next_block, prevent, nind, next_path, ldict)
        return
    # Want to insert elements into previous block
    while nind < len(next_block):
        # insert takes it out of old
        pind = pind + 1
        prev_block.insert(pind, next_block[nind])
    next_block.getparent().remove(next_block)
 def polish_index_markup(index, blocks):
    # Blocks are in reverse order at this point
    path_map = {}
    ldict = {}
    for block in blocks:
        cls = block.get('class', '') or ''
        block.set('class', (cls + ' index-entry').lstrip())
        a = block.xpath('descendant::a[1]')
        text = ''
        if a:
            text = etree.tostring(a[0], method='text', with_tail=False, encoding='unicode').strip()
        if ':' in text:
            path_map[block] = parts = list(filter(None, (x.strip() for x in text.split(':'))))
            if len(parts) > 1:
                split_up_block(block, a[0], text, parts, ldict)
        else:
            # try using a span all the time
            path_map[block] = [text]
            parent = a[0].getparent()
            span = parent.makeelement('span', style='display:block; margin-left: 0em')
            parent.append(span)
            span.append(a[0])
            ldict[span] = 0
        for br in block.xpath('descendant::br'):
            br.tail = None
    # We want a single block for each main entry
    prev_block = blocks[0]
    for block in blocks[1:]:
        pp, pn = path_map[prev_block], path_map[block]
        if pp[0] == pn[0]:
            merge_blocks(prev_block, block, 0, 0, pn, ldict)
        else:
            prev_block = block
--- a/ebook_converter/ebooks/docx/names.py
+++ b/ebook_converter/ebooks/docx/names.py
@@ -0,0 +1,144 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import re
 from lxml.etree import XPath as X
 from calibre.utils.filenames import ascii_text
 from polyglot.builtins import iteritems
 # Names {{{
 TRANSITIONAL_NAMES = {
    'DOCUMENT'  : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument',
    'DOCPROPS'  : 'http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties',
    'APPPROPS'  : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties',
    'STYLES'    : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles',
    'NUMBERING' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering',
    'FONTS'     : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable',
    'EMBEDDED_FONT' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/font',
    'IMAGES'    : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/image',
    'LINKS'     : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink',
    'FOOTNOTES' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes',
    'ENDNOTES'  : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes',
    'THEMES'    : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme',
    'SETTINGS'  : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings',
    'WEB_SETTINGS' : 'http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings',
 }
 STRICT_NAMES = {
    k:v.replace('http://schemas.openxmlformats.org/officeDocument/2006',  'http://purl.oclc.org/ooxml/officeDocument')
    for k, v in iteritems(TRANSITIONAL_NAMES)
 }
 TRANSITIONAL_NAMESPACES = {
    'mo': 'http://schemas.microsoft.com/office/mac/office/2008/main',
    'o': 'urn:schemas-microsoft-com:office:office',
    've': 'http://schemas.openxmlformats.org/markup-compatibility/2006',
    'mc': 'http://schemas.openxmlformats.org/markup-compatibility/2006',
    # Text Content
    'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main',
    'w10': 'urn:schemas-microsoft-com:office:word',
    'wne': 'http://schemas.microsoft.com/office/word/2006/wordml',
    'xml': 'http://www.w3.org/XML/1998/namespace',
    # Drawing
    'a': 'http://schemas.openxmlformats.org/drawingml/2006/main',
    'm': 'http://schemas.openxmlformats.org/officeDocument/2006/math',
    'mv': 'urn:schemas-microsoft-com:mac:vml',
    'pic': 'http://schemas.openxmlformats.org/drawingml/2006/picture',
    'v': 'urn:schemas-microsoft-com:vml',
    'wp': 'http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing',
    # Properties (core and extended)
    'cp': 'http://schemas.openxmlformats.org/package/2006/metadata/core-properties',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'ep': 'http://schemas.openxmlformats.org/officeDocument/2006/extended-properties',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
    # Content Types
    'ct': 'http://schemas.openxmlformats.org/package/2006/content-types',
    # Package Relationships
    'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships',
    'pr': 'http://schemas.openxmlformats.org/package/2006/relationships',
    # Dublin Core document properties
    'dcmitype': 'http://purl.org/dc/dcmitype/',
    'dcterms': 'http://purl.org/dc/terms/'
 }
 STRICT_NAMESPACES = {
    k:v.replace(
        'http://schemas.openxmlformats.org/officeDocument/2006', 'http://purl.oclc.org/ooxml/officeDocument').replace(
        'http://schemas.openxmlformats.org/wordprocessingml/2006', 'http://purl.oclc.org/ooxml/wordprocessingml').replace(
        'http://schemas.openxmlformats.org/drawingml/2006', 'http://purl.oclc.org/ooxml/drawingml')
    for k, v in iteritems(TRANSITIONAL_NAMESPACES)
 }
 # }}}
 def barename(x):
    return x.rpartition('}')[-1]
 def XML(x):
    return '{%s}%s' % (TRANSITIONAL_NAMESPACES['xml'], x)
 def generate_anchor(name, existing):
    x = y = 'id_' + re.sub(r'[^0-9a-zA-Z_]', '', ascii_text(name)).lstrip('_')
    c = 1
    while y in existing:
        y = '%s_%d' % (x, c)
        c += 1
    return y
 class DOCXNamespace(object):
    def __init__(self, transitional=True):
        self.xpath_cache = {}
        if transitional:
            self.namespaces = TRANSITIONAL_NAMESPACES.copy()
            self.names = TRANSITIONAL_NAMES.copy()
        else:
            self.namespaces = STRICT_NAMESPACES.copy()
            self.names = STRICT_NAMES.copy()
    def XPath(self, expr):
        ans = self.xpath_cache.get(expr, None)
        if ans is None:
            self.xpath_cache[expr] = ans = X(expr, namespaces=self.namespaces)
        return ans
    def is_tag(self, x, q):
        tag = getattr(x, 'tag', x)
        ns, name = q.partition(':')[0::2]
        return '{%s}%s' % (self.namespaces.get(ns, None), name) == tag
    def expand(self, name, sep=':'):
        ns, tag = name.partition(sep)[::2]
        if ns and tag:
            tag = '{%s}%s' % (self.namespaces[ns], tag)
        return tag or ns
    def get(self, x, attr, default=None):
        return x.attrib.get(self.expand(attr), default)
    def ancestor(self, elem, name):
        try:
            return self.XPath('ancestor::%s[1]' % name)(elem)[0]
        except IndexError:
            return None
    def children(self, elem, *args):
        return self.XPath('|'.join('child::%s' % a for a in args))(elem)
    def descendants(self, elem, *args):
        return self.XPath('|'.join('descendant::%s' % a for a in args))(elem)
    def makeelement(self, root, tag, append=True, **attrs):
        ans = root.makeelement(self.expand(tag), **{self.expand(k, sep='_'):v for k, v in iteritems(attrs)})
        if append:
            root.append(ans)
        return ans
--- a/ebook_converter/ebooks/docx/numbering.py
+++ b/ebook_converter/ebooks/docx/numbering.py
@@ -0,0 +1,388 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import re, string
 from collections import Counter, defaultdict
 from functools import partial
 from lxml.html.builder import OL, UL, SPAN
 from calibre.ebooks.docx.block_styles import ParagraphStyle
 from calibre.ebooks.docx.char_styles import RunStyle, inherit
 from calibre.ebooks.metadata import roman
 from polyglot.builtins import iteritems, unicode_type
 STYLE_MAP = {
    'aiueo': 'hiragana',
    'aiueoFullWidth': 'hiragana',
    'hebrew1': 'hebrew',
    'iroha': 'katakana-iroha',
    'irohaFullWidth': 'katakana-iroha',
    'lowerLetter': 'lower-alpha',
    'lowerRoman': 'lower-roman',
    'none': 'none',
    'upperLetter': 'upper-alpha',
    'upperRoman': 'upper-roman',
    'chineseCounting': 'cjk-ideographic',
    'decimalZero': 'decimal-leading-zero',
 }
 def alphabet(val, lower=True):
    x = string.ascii_lowercase if lower else string.ascii_uppercase
    return x[(abs(val - 1)) % len(x)]
 alphabet_map = {
    'lower-alpha':alphabet, 'upper-alpha':partial(alphabet, lower=False),
    'lower-roman':lambda x:roman(x).lower(), 'upper-roman':roman,
    'decimal-leading-zero': lambda x: '0%d' % x
 }
 class Level(object):
    def __init__(self, namespace, lvl=None):
        self.namespace = namespace
        self.restart = None
        self.start = 0
        self.fmt = 'decimal'
        self.para_link = None
        self.paragraph_style = self.character_style = None
        self.is_numbered = False
        self.num_template = None
        self.bullet_template = None
        self.pic_id = None
        if lvl is not None:
            self.read_from_xml(lvl)
    def copy(self):
        ans = Level(self.namespace)
        for x in ('restart', 'pic_id', 'start', 'fmt', 'para_link', 'paragraph_style', 'character_style', 'is_numbered', 'num_template', 'bullet_template'):
            setattr(ans, x, getattr(self, x))
        return ans
    def format_template(self, counter, ilvl, template):
        def sub(m):
            x = int(m.group(1)) - 1
            if x > ilvl or x not in counter:
                return ''
            val = counter[x] - (0 if x == ilvl else 1)
            formatter = alphabet_map.get(self.fmt, lambda x: '%d' % x)
            return formatter(val)
        return re.sub(r'%(\d+)', sub, template).rstrip() + '\xa0'
    def read_from_xml(self, lvl, override=False):
        XPath, get = self.namespace.XPath, self.namespace.get
        for lr in XPath('./w:lvlRestart[@w:val]')(lvl):
            try:
                self.restart = int(get(lr, 'w:val'))
            except (TypeError, ValueError):
                pass
        for lr in XPath('./w:start[@w:val]')(lvl):
            try:
                self.start = int(get(lr, 'w:val'))
            except (TypeError, ValueError):
                pass
        for rPr in XPath('./w:rPr')(lvl):
            ps = RunStyle(self.namespace, rPr)
            if self.character_style is None:
                self.character_style = ps
            else:
                self.character_style.update(ps)
        lt = None
        for lr in XPath('./w:lvlText[@w:val]')(lvl):
            lt = get(lr, 'w:val')
        for lr in XPath('./w:numFmt[@w:val]')(lvl):
            val = get(lr, 'w:val')
            if val == 'bullet':
                self.is_numbered = False
                cs = self.character_style
                if lt in {'\uf0a7', 'o'} or (
                    cs is not None and cs.font_family is not inherit and cs.font_family.lower() in {'wingdings', 'symbol'}):
                    self.fmt = {'\uf0a7':'square', 'o':'circle'}.get(lt, 'disc')
                else:
                    self.bullet_template = lt
                for lpid in XPath('./w:lvlPicBulletId[@w:val]')(lvl):
                    self.pic_id = get(lpid, 'w:val')
            else:
                self.is_numbered = True
                self.fmt = STYLE_MAP.get(val, 'decimal')
                if lt and re.match(r'%\d+\.$', lt) is None:
                    self.num_template = lt
        for lr in XPath('./w:pStyle[@w:val]')(lvl):
            self.para_link = get(lr, 'w:val')
        for pPr in XPath('./w:pPr')(lvl):
            ps = ParagraphStyle(self.namespace, pPr)
            if self.paragraph_style is None:
                self.paragraph_style = ps
            else:
                self.paragraph_style.update(ps)
    def css(self, images, pic_map, rid_map):
        ans = {'list-style-type': self.fmt}
        if self.pic_id:
            rid = pic_map.get(self.pic_id, None)
            if rid:
                try:
                    fname = images.generate_filename(rid, rid_map=rid_map, max_width=20, max_height=20)
                except Exception:
                    fname = None
                else:
                    ans['list-style-image'] = 'url("images/%s")' % fname
        return ans
    def char_css(self):
        try:
            css = self.character_style.css
        except AttributeError:
            css = {}
        css.pop('font-family', None)
        return css
 class NumberingDefinition(object):
    def __init__(self, namespace, parent=None, an_id=None):
        self.namespace = namespace
        XPath, get = self.namespace.XPath, self.namespace.get
        self.levels = {}
        self.abstract_numbering_definition_id = an_id
        if parent is not None:
            for lvl in XPath('./w:lvl')(parent):
                try:
                    ilvl = int(get(lvl, 'w:ilvl', 0))
                except (TypeError, ValueError):
                    ilvl = 0
                self.levels[ilvl] = Level(namespace, lvl)
    def copy(self):
        ans = NumberingDefinition(self.namespace, an_id=self.abstract_numbering_definition_id)
        for l, lvl in iteritems(self.levels):
            ans.levels[l] = lvl.copy()
        return ans
 class Numbering(object):
    def __init__(self, namespace):
        self.namespace = namespace
        self.definitions = {}
        self.instances = {}
        self.counters = defaultdict(Counter)
        self.starts = {}
        self.pic_map = {}
    def __call__(self, root, styles, rid_map):
        ' Read all numbering style definitions '
        XPath, get = self.namespace.XPath, self.namespace.get
        self.rid_map = rid_map
        for npb in XPath('./w:numPicBullet[@w:numPicBulletId]')(root):
            npbid = get(npb, 'w:numPicBulletId')
            for idata in XPath('descendant::v:imagedata[@r:id]')(npb):
                rid = get(idata, 'r:id')
                self.pic_map[npbid] = rid
        lazy_load = {}
        for an in XPath('./w:abstractNum[@w:abstractNumId]')(root):
            an_id = get(an, 'w:abstractNumId')
            nsl = XPath('./w:numStyleLink[@w:val]')(an)
            if nsl:
                lazy_load[an_id] = get(nsl[0], 'w:val')
            else:
                nd = NumberingDefinition(self.namespace, an, an_id=an_id)
                self.definitions[an_id] = nd
        def create_instance(n, definition):
            nd = definition.copy()
            start_overrides = {}
            for lo in XPath('./w:lvlOverride')(n):
                try:
                    ilvl = int(get(lo, 'w:ilvl'))
                except (ValueError, TypeError):
                    ilvl = None
                for so in XPath('./w:startOverride[@w:val]')(lo):
                    try:
                        start_override = int(get(so, 'w:val'))
                    except (TypeError, ValueError):
                        pass
                    else:
                        start_overrides[ilvl] = start_override
                for lvl in XPath('./w:lvl')(lo)[:1]:
                    nilvl = get(lvl, 'w:ilvl')
                    ilvl = nilvl if ilvl is None else ilvl
                    alvl = nd.levels.get(ilvl, None)
                    if alvl is None:
                        alvl = Level(self.namespace)
                    alvl.read_from_xml(lvl, override=True)
            for ilvl, so in iteritems(start_overrides):
                try:
                    nd.levels[ilvl].start = start_override
                except KeyError:
                    pass
            return nd
        next_pass = {}
        for n in XPath('./w:num[@w:numId]')(root):
            an_id = None
            num_id = get(n, 'w:numId')
            for an in XPath('./w:abstractNumId[@w:val]')(n):
                an_id = get(an, 'w:val')
            d = self.definitions.get(an_id, None)
            if d is None:
                next_pass[num_id] = (an_id, n)
                continue
            self.instances[num_id] = create_instance(n, d)
        numbering_links = styles.numbering_style_links
        for an_id, style_link in iteritems(lazy_load):
            num_id = numbering_links[style_link]
            self.definitions[an_id] = self.instances[num_id].copy()
        for num_id, (an_id, n) in iteritems(next_pass):
            d = self.definitions.get(an_id, None)
            if d is not None:
                self.instances[num_id] = create_instance(n, d)
        for num_id, d in iteritems(self.instances):
            self.starts[num_id] = {lvl:d.levels[lvl].start for lvl in d.levels}
    def get_pstyle(self, num_id, style_id):
        d = self.instances.get(num_id, None)
        if d is not None:
            for ilvl, lvl in iteritems(d.levels):
                if lvl.para_link == style_id:
                    return ilvl
    def get_para_style(self, num_id, lvl):
        d = self.instances.get(num_id, None)
        if d is not None:
            lvl = d.levels.get(lvl, None)
            return getattr(lvl, 'paragraph_style', None)
    def update_counter(self, counter, levelnum, levels):
        counter[levelnum] += 1
        for ilvl, lvl in iteritems(levels):
            restart = lvl.restart
            if (restart is None and ilvl == levelnum + 1) or restart == levelnum + 1:
                counter[ilvl] = lvl.start
    def apply_markup(self, items, body, styles, object_map, images):
        seen_instances = set()
        for p, num_id, ilvl in items:
            d = self.instances.get(num_id, None)
            if d is not None:
                lvl = d.levels.get(ilvl, None)
                if lvl is not None:
                    an_id = d.abstract_numbering_definition_id
                    counter = self.counters[an_id]
                    if ilvl not in counter or num_id not in seen_instances:
                        counter[ilvl] = self.starts[num_id][ilvl]
                    seen_instances.add(num_id)
                    p.tag = 'li'
                    p.set('value', '%s' % counter[ilvl])
                    p.set('list-lvl', unicode_type(ilvl))
                    p.set('list-id', num_id)
                    if lvl.num_template is not None:
                        val = lvl.format_template(counter, ilvl, lvl.num_template)
                        p.set('list-template', val)
                    elif lvl.bullet_template is not None:
                        val = lvl.format_template(counter, ilvl, lvl.bullet_template)
                        p.set('list-template', val)
                    self.update_counter(counter, ilvl, d.levels)
        templates = {}
        def commit(current_run):
            if not current_run:
                return
            start = current_run[0]
            parent = start.getparent()
            idx = parent.index(start)
            d = self.instances[start.get('list-id')]
            ilvl = int(start.get('list-lvl'))
            lvl = d.levels[ilvl]
            lvlid = start.get('list-id') + start.get('list-lvl')
            has_template = 'list-template' in start.attrib
            wrap = (OL if lvl.is_numbered or has_template else UL)('\n\t')
            if has_template:
                wrap.set('lvlid', lvlid)
            else:
                wrap.set('class', styles.register(lvl.css(images, self.pic_map, self.rid_map), 'list'))
            ccss = lvl.char_css()
            if ccss:
                ccss = styles.register(ccss, 'bullet')
            parent.insert(idx, wrap)
            last_val = None
            for child in current_run:
                wrap.append(child)
                child.tail = '\n\t'
                if has_template:
                    span = SPAN()
                    span.text = child.text
                    child.text = None
                    for gc in child:
                        span.append(gc)
                    child.append(span)
                    span = SPAN(child.get('list-template'))
                    if ccss:
                        span.set('class', ccss)
                    last = templates.get(lvlid, '')
                    if span.text and len(span.text) > len(last):
                        templates[lvlid] = span.text
                    child.insert(0, span)
                for attr in ('list-lvl', 'list-id', 'list-template'):
                    child.attrib.pop(attr, None)
                val = int(child.get('value'))
                if last_val == val - 1 or wrap.tag == 'ul' or (last_val is None and val == 1):
                    child.attrib.pop('value')
                last_val = val
            current_run[-1].tail = '\n'
            del current_run[:]
        parents = set()
        for child in body.iterdescendants('li'):
            parents.add(child.getparent())
        for parent in parents:
            current_run = []
            for child in parent:
                if child.tag == 'li':
                    if current_run:
                        last = current_run[-1]
                        if (last.get('list-id') , last.get('list-lvl')) != (child.get('list-id'), child.get('list-lvl')):
                            commit(current_run)
                    current_run.append(child)
                else:
                    commit(current_run)
            commit(current_run)
        # Convert the list items that use custom text for bullets into tables
        # so that they display correctly
        for wrap in body.xpath('//ol[@lvlid]'):
            wrap.attrib.pop('lvlid')
            wrap.tag = 'div'
            wrap.set('style', 'display:table')
            for i, li in enumerate(wrap.iterchildren('li')):
                li.tag = 'div'
                li.attrib.pop('value', None)
                li.set('style', 'display:table-row')
                obj = object_map[li]
                bs = styles.para_cache[obj]
                if i == 0:
                    wrap.set('style', 'display:table; padding-left:%s' %
                             bs.css.get('margin-left', '0'))
                bs.css.pop('margin-left', None)
                for child in li:
                    child.set('style', 'display:table-cell')
--- a/ebook_converter/ebooks/docx/settings.py
+++ b/ebook_converter/ebooks/docx/settings.py
@@ -0,0 +1,21 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 class Settings(object):
    def __init__(self, namespace):
        self.default_tab_stop = 720 / 20
        self.namespace = namespace
    def __call__(self, root):
        for dts in self.namespace.XPath('//w:defaultTabStop[@w:val]')(root):
            try:
                self.default_tab_stop = int(self.namespace.get(dts, 'w:val')) / 20
            except (ValueError, TypeError, AttributeError):
                pass
--- a/ebook_converter/ebooks/docx/styles.py
+++ b/ebook_converter/ebooks/docx/styles.py
@@ -0,0 +1,504 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import textwrap
 from collections import OrderedDict, Counter
 from calibre.ebooks.docx.block_styles import ParagraphStyle, inherit, twips
 from calibre.ebooks.docx.char_styles import RunStyle
 from calibre.ebooks.docx.tables import TableStyle
 from polyglot.builtins import iteritems, itervalues
 class PageProperties(object):
    '''
    Class representing page level properties (page size/margins) read from
    sectPr elements.
    '''
    def __init__(self, namespace, elems=()):
        self.width, self.height = 595.28, 841.89  # pts, A4
        self.margin_left = self.margin_right = 72  # pts
        def setval(attr, val):
            val = twips(val)
            if val is not None:
                setattr(self, attr, val)
        for sectPr in elems:
            for pgSz in namespace.XPath('./w:pgSz')(sectPr):
                w, h = namespace.get(pgSz, 'w:w'), namespace.get(pgSz, 'w:h')
                setval('width', w), setval('height', h)
            for pgMar in namespace.XPath('./w:pgMar')(sectPr):
                l, r = namespace.get(pgMar, 'w:left'), namespace.get(pgMar, 'w:right')
                setval('margin_left', l), setval('margin_right', r)
 class Style(object):
    '''
    Class representing a <w:style> element. Can contain block, character, etc. styles.
    '''
    def __init__(self, namespace, elem):
        self.namespace = namespace
        self.name_path = namespace.XPath('./w:name[@w:val]')
        self.based_on_path = namespace.XPath('./w:basedOn[@w:val]')
        self.resolved = False
        self.style_id = namespace.get(elem, 'w:styleId')
        self.style_type = namespace.get(elem, 'w:type')
        names = self.name_path(elem)
        self.name = namespace.get(names[-1], 'w:val') if names else None
        based_on = self.based_on_path(elem)
        self.based_on = namespace.get(based_on[0], 'w:val') if based_on else None
        if self.style_type == 'numbering':
            self.based_on = None
        self.is_default = namespace.get(elem, 'w:default') in {'1', 'on', 'true'}
        self.paragraph_style = self.character_style = self.table_style = None
        if self.style_type in {'paragraph', 'character', 'table'}:
            if self.style_type == 'table':
                for tblPr in namespace.XPath('./w:tblPr')(elem):
                    ts = TableStyle(namespace, tblPr)
                    if self.table_style is None:
                        self.table_style = ts
                    else:
                        self.table_style.update(ts)
            if self.style_type in {'paragraph', 'table'}:
                for pPr in namespace.XPath('./w:pPr')(elem):
                    ps = ParagraphStyle(namespace, pPr)
                    if self.paragraph_style is None:
                        self.paragraph_style = ps
                    else:
                        self.paragraph_style.update(ps)
            for rPr in namespace.XPath('./w:rPr')(elem):
                rs = RunStyle(namespace, rPr)
                if self.character_style is None:
                    self.character_style = rs
                else:
                    self.character_style.update(rs)
        if self.style_type in {'numbering', 'paragraph'}:
            self.numbering_style_link = None
            for x in namespace.XPath('./w:pPr/w:numPr/w:numId[@w:val]')(elem):
                self.numbering_style_link = namespace.get(x, 'w:val')
    def resolve_based_on(self, parent):
        if parent.table_style is not None:
            if self.table_style is None:
                self.table_style = TableStyle(self.namespace)
            self.table_style.resolve_based_on(parent.table_style)
        if parent.paragraph_style is not None:
            if self.paragraph_style is None:
                self.paragraph_style = ParagraphStyle(self.namespace)
            self.paragraph_style.resolve_based_on(parent.paragraph_style)
        if parent.character_style is not None:
            if self.character_style is None:
                self.character_style = RunStyle(self.namespace)
            self.character_style.resolve_based_on(parent.character_style)
 class Styles(object):
    '''
    Collection of all styles defined in the document. Used to get the final styles applicable to elements in the document markup.
    '''
    def __init__(self, namespace, tables):
        self.namespace = namespace
        self.id_map = OrderedDict()
        self.para_cache = {}
        self.para_char_cache = {}
        self.run_cache = {}
        self.classes = {}
        self.counter = Counter()
        self.default_styles = {}
        self.tables = tables
        self.numbering_style_links = {}
        self.default_paragraph_style = self.default_character_style = None
    def __iter__(self):
        for s in itervalues(self.id_map):
            yield s
    def __getitem__(self, key):
        return self.id_map[key]
    def __len__(self):
        return len(self.id_map)
    def get(self, key, default=None):
        return self.id_map.get(key, default)
    def __call__(self, root, fonts, theme):
        self.fonts, self.theme = fonts, theme
        self.default_paragraph_style = self.default_character_style = None
        if root is not None:
            for s in self.namespace.XPath('//w:style')(root):
                s = Style(self.namespace, s)
                if s.style_id:
                    self.id_map[s.style_id] = s
                if s.is_default:
                    self.default_styles[s.style_type] = s
                if getattr(s, 'numbering_style_link', None) is not None:
                    self.numbering_style_links[s.style_id] = s.numbering_style_link
            for dd in self.namespace.XPath('./w:docDefaults')(root):
                for pd in self.namespace.XPath('./w:pPrDefault')(dd):
                    for pPr in self.namespace.XPath('./w:pPr')(pd):
                        ps = ParagraphStyle(self.namespace, pPr)
                        if self.default_paragraph_style is None:
                            self.default_paragraph_style = ps
                        else:
                            self.default_paragraph_style.update(ps)
                for pd in self.namespace.XPath('./w:rPrDefault')(dd):
                    for pPr in self.namespace.XPath('./w:rPr')(pd):
                        ps = RunStyle(self.namespace, pPr)
                        if self.default_character_style is None:
                            self.default_character_style = ps
                        else:
                            self.default_character_style.update(ps)
        def resolve(s, p):
            if p is not None:
                if not p.resolved:
                    resolve(p, self.get(p.based_on))
                s.resolve_based_on(p)
            s.resolved = True
        for s in self:
            if not s.resolved:
                resolve(s, self.get(s.based_on))
    def para_val(self, parent_styles, direct_formatting, attr):
        val = getattr(direct_formatting, attr)
        if val is inherit:
            for ps in reversed(parent_styles):
                pval = getattr(ps, attr)
                if pval is not inherit:
                    val = pval
                    break
        return val
    def run_val(self, parent_styles, direct_formatting, attr):
        val = getattr(direct_formatting, attr)
        if val is not inherit:
            return val
        if attr in direct_formatting.toggle_properties:
            # The spec (section 17.7.3) does not make sense, so we follow the behavior
            # of Word, which seems to only consider the document default if the
            # property has not been defined in any styles.
            vals = [int(getattr(rs, attr)) for rs in parent_styles if rs is not self.default_character_style and getattr(rs, attr) is not inherit]
            if vals:
                return sum(vals) % 2 == 1
            if self.default_character_style is not None:
                return getattr(self.default_character_style, attr) is True
            return False
        for rs in reversed(parent_styles):
            rval = getattr(rs, attr)
            if rval is not inherit:
                return rval
        return val
    def resolve_paragraph(self, p):
        ans = self.para_cache.get(p, None)
        if ans is None:
            linked_style = None
            ans = self.para_cache[p] = ParagraphStyle(self.namespace)
            ans.style_name = None
            direct_formatting = None
            is_section_break = False
            for pPr in self.namespace.XPath('./w:pPr')(p):
                ps = ParagraphStyle(self.namespace, pPr)
                if direct_formatting is None:
                    direct_formatting = ps
                else:
                    direct_formatting.update(ps)
                if self.namespace.XPath('./w:sectPr')(pPr):
                    is_section_break = True
            if direct_formatting is None:
                direct_formatting = ParagraphStyle(self.namespace)
            parent_styles = []
            if self.default_paragraph_style is not None:
                parent_styles.append(self.default_paragraph_style)
            ts = self.tables.para_style(p)
            if ts is not None:
                parent_styles.append(ts)
            default_para = self.default_styles.get('paragraph', None)
            if direct_formatting.linked_style is not None:
                ls = linked_style = self.get(direct_formatting.linked_style)
                if ls is not None:
                    ans.style_name = ls.name
                    ps = ls.paragraph_style
                    if ps is not None:
                        parent_styles.append(ps)
                    if ls.character_style is not None:
                        self.para_char_cache[p] = ls.character_style
            elif default_para is not None:
                if default_para.paragraph_style is not None:
                    parent_styles.append(default_para.paragraph_style)
                if default_para.character_style is not None:
                    self.para_char_cache[p] = default_para.character_style
            def has_numbering(block_style):
                num_id, lvl = getattr(block_style, 'numbering_id', inherit), getattr(block_style, 'numbering_level', inherit)
                return num_id is not None and num_id is not inherit and lvl is not None and lvl is not inherit
            is_numbering = has_numbering(direct_formatting)
            is_section_break = is_section_break and not self.namespace.XPath('./w:r')(p)
            if is_numbering and not is_section_break:
                num_id, lvl = direct_formatting.numbering_id, direct_formatting.numbering_level
                p.set('calibre_num_id', '%s:%s' % (lvl, num_id))
                ps = self.numbering.get_para_style(num_id, lvl)
                if ps is not None:
                    parent_styles.append(ps)
            if (
                not is_numbering and not is_section_break and linked_style is not None and has_numbering(linked_style.paragraph_style)
            ):
                num_id, lvl = linked_style.paragraph_style.numbering_id, linked_style.paragraph_style.numbering_level
                p.set('calibre_num_id', '%s:%s' % (lvl, num_id))
                is_numbering = True
                ps = self.numbering.get_para_style(num_id, lvl)
                if ps is not None:
                    parent_styles.append(ps)
            for attr in ans.all_properties:
                if not (is_numbering and attr == 'text_indent'):  # skip text-indent for lists
                    setattr(ans, attr, self.para_val(parent_styles, direct_formatting, attr))
            ans.linked_style = direct_formatting.linked_style
        return ans
    def resolve_run(self, r):
        ans = self.run_cache.get(r, None)
        if ans is None:
            p = self.namespace.XPath('ancestor::w:p[1]')(r)
            p = p[0] if p else None
            ans = self.run_cache[r] = RunStyle(self.namespace)
            direct_formatting = None
            for rPr in self.namespace.XPath('./w:rPr')(r):
                rs = RunStyle(self.namespace, rPr)
                if direct_formatting is None:
                    direct_formatting = rs
                else:
                    direct_formatting.update(rs)
            if direct_formatting is None:
                direct_formatting = RunStyle(self.namespace)
            parent_styles = []
            default_char = self.default_styles.get('character', None)
            if self.default_character_style is not None:
                parent_styles.append(self.default_character_style)
            pstyle = self.para_char_cache.get(p, None)
            if pstyle is not None:
                parent_styles.append(pstyle)
            # As best as I can understand the spec, table overrides should be
            # applied before paragraph overrides, but word does it
            # this way, see the December 2007 table header in the demo
            # document.
            ts = self.tables.run_style(p)
            if ts is not None:
                parent_styles.append(ts)
            if direct_formatting.linked_style is not None:
                ls = getattr(self.get(direct_formatting.linked_style), 'character_style', None)
                if ls is not None:
                    parent_styles.append(ls)
            elif default_char is not None and default_char.character_style is not None:
                parent_styles.append(default_char.character_style)
            for attr in ans.all_properties:
                setattr(ans, attr, self.run_val(parent_styles, direct_formatting, attr))
            if ans.font_family is not inherit:
                ff = self.theme.resolve_font_family(ans.font_family)
                ans.font_family = self.fonts.family_for(ff, ans.b, ans.i)
        return ans
    def resolve(self, obj):
        if obj.tag.endswith('}p'):
            return self.resolve_paragraph(obj)
        if obj.tag.endswith('}r'):
            return self.resolve_run(obj)
    def cascade(self, layers):
        self.body_font_family = 'serif'
        self.body_font_size = '10pt'
        self.body_color = 'black'
        def promote_property(char_styles, block_style, prop):
            vals = {getattr(s, prop) for s in char_styles}
            if len(vals) == 1:
                # All the character styles have the same value
                for s in char_styles:
                    setattr(s, prop, inherit)
                setattr(block_style, prop, next(iter(vals)))
        for p, runs in iteritems(layers):
            has_links = '1' in {r.get('is-link', None) for r in runs}
            char_styles = [self.resolve_run(r) for r in runs]
            block_style = self.resolve_paragraph(p)
            for prop in ('font_family', 'font_size', 'cs_font_family', 'cs_font_size', 'color'):
                if has_links and prop == 'color':
                    # We cannot promote color as browser rendering engines will
                    # override the link color setting it to blue, unless the
                    # color is specified on the link element itself
                    continue
                promote_property(char_styles, block_style, prop)
            for s in char_styles:
                if s.text_decoration == 'none':
                    # The default text decoration is 'none'
                    s.text_decoration = inherit
        def promote_most_common(block_styles, prop, default):
            c = Counter()
            for s in block_styles:
                val = getattr(s, prop)
                if val is not inherit:
                    c[val] += 1
            val = None
            if c:
                val = c.most_common(1)[0][0]
                for s in block_styles:
                    oval = getattr(s, prop)
                    if oval is inherit:
                        if default != val:
                            setattr(s, prop, default)
                    elif oval == val:
                        setattr(s, prop, inherit)
            return val
        block_styles = tuple(self.resolve_paragraph(p) for p in layers)
        ff = promote_most_common(block_styles, 'font_family', self.body_font_family)
        if ff is not None:
            self.body_font_family = ff
        fs = promote_most_common(block_styles, 'font_size', int(self.body_font_size[:2]))
        if fs is not None:
            self.body_font_size = '%.3gpt' % fs
        color = promote_most_common(block_styles, 'color', self.body_color)
        if color is not None:
            self.body_color = color
    def resolve_numbering(self, numbering):
        # When a numPr element appears inside a paragraph style, the lvl info
        # must be discarded and pStyle used instead.
        self.numbering = numbering
        for style in self:
            ps = style.paragraph_style
            if ps is not None and ps.numbering_id is not inherit:
                lvl = numbering.get_pstyle(ps.numbering_id, style.style_id)
                if lvl is None:
                    ps.numbering_id = ps.numbering_level = inherit
                else:
                    ps.numbering_level = lvl
    def apply_contextual_spacing(self, paras):
        last_para = None
        for p in paras:
            if last_para is not None:
                ls = self.resolve_paragraph(last_para)
                ps = self.resolve_paragraph(p)
                if ls.linked_style is not None and ls.linked_style == ps.linked_style:
                    if ls.contextualSpacing is True:
                        ls.margin_bottom = 0
                    if ps.contextualSpacing is True:
                        ps.margin_top = 0
            last_para = p
    def apply_section_page_breaks(self, paras):
        for p in paras:
            ps = self.resolve_paragraph(p)
            ps.pageBreakBefore = True
    def register(self, css, prefix):
        h = hash(frozenset(iteritems(css)))
        ans, _ = self.classes.get(h, (None, None))
        if ans is None:
            self.counter[prefix] += 1
            ans = '%s_%d' % (prefix, self.counter[prefix])
            self.classes[h] = (ans, css)
        return ans
    def generate_classes(self):
        for bs in itervalues(self.para_cache):
            css = bs.css
            if css:
                self.register(css, 'block')
        for bs in itervalues(self.run_cache):
            css = bs.css
            if css:
                self.register(css, 'text')
    def class_name(self, css):
        h = hash(frozenset(iteritems(css)))
        return self.classes.get(h, (None, None))[0]
    def generate_css(self, dest_dir, docx, notes_nopb, nosupsub):
        ef = self.fonts.embed_fonts(dest_dir, docx)
        s = '''\
            body { font-family: %s; font-size: %s; color: %s }
            /* In word all paragraphs have zero margins unless explicitly specified in a style */
            p, h1, h2, h3, h4, h5, h6, div { margin: 0; padding: 0 }
            /* In word headings only have bold font if explicitly specified,
                similarly the font size is the body font size, unless explicitly set. */
            h1, h2, h3, h4, h5, h6 { font-weight: normal; font-size: 1rem }
            /* Setting padding-left to zero breaks rendering of lists, so we only set the other values to zero and leave padding-left for the user-agent */
            ul, ol { margin: 0; padding-top: 0; padding-bottom: 0; padding-right: 0 }
            /* The word hyperlink styling will set text-decoration to underline if needed */
            a { text-decoration: none }
            sup.noteref a { text-decoration: none }
            h1.notes-header { page-break-before: always }
            dl.footnote dt { font-size: large }
            dl.footnote dt a { text-decoration: none }
            '''
        if not notes_nopb:
            s += '''\
            dl.footnote { page-break-after: always }
            dl.footnote:last-of-type { page-break-after: avoid }
            '''
        s = s + '''\
            span.tab { white-space: pre }
            p.index-entry { text-indent: 0pt; }
            p.index-entry a:visited { color: blue }
            p.index-entry a:hover { color: red }
            '''
        if nosupsub:
            s = s + '''\
               sup { vertical-align: top }
               sub { vertical-align: bottom }
               '''
        prefix = textwrap.dedent(s) % (self.body_font_family, self.body_font_size, self.body_color)
        if ef:
            prefix = ef + '\n' + prefix
        ans = []
        for (cls, css) in sorted(itervalues(self.classes), key=lambda x:x[0]):
            b = ('\t%s: %s;' % (k, v) for k, v in iteritems(css))
            b = '\n'.join(b)
            ans.append('.%s {\n%s\n}\n' % (cls, b.rstrip(';')))
        return prefix + '\n' + '\n'.join(ans)
--- a/ebook_converter/ebooks/docx/tables.py
+++ b/ebook_converter/ebooks/docx/tables.py
@@ -0,0 +1,700 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from lxml.html.builder import TABLE, TR, TD
 from calibre.ebooks.docx.block_styles import inherit, read_shd as rs, read_border, binary_property, border_props, ParagraphStyle, border_to_css
 from calibre.ebooks.docx.char_styles import RunStyle
 from polyglot.builtins import filter, iteritems, itervalues, range, unicode_type
 # Read from XML {{{
 read_shd = rs
 edges = ('left', 'top', 'right', 'bottom')
 def _read_width(elem, get):
    ans = inherit
    try:
        w = int(get(elem, 'w:w'))
    except (TypeError, ValueError):
        w = 0
    typ = get(elem, 'w:type', 'auto')
    if typ == 'nil':
        ans = '0'
    elif typ == 'auto':
        ans = 'auto'
    elif typ == 'dxa':
        ans = '%.3gpt' % (w/20)
    elif typ == 'pct':
        ans = '%.3g%%' % (w/50)
    return ans
 def read_width(parent, dest, XPath, get):
    ans = inherit
    for tblW in XPath('./w:tblW')(parent):
        ans = _read_width(tblW, get)
    setattr(dest, 'width', ans)
 def read_cell_width(parent, dest, XPath, get):
    ans = inherit
    for tblW in XPath('./w:tcW')(parent):
        ans = _read_width(tblW, get)
    setattr(dest, 'width', ans)
 def read_padding(parent, dest, XPath, get):
    name = 'tblCellMar' if parent.tag.endswith('}tblPr') else 'tcMar'
    ans = {x:inherit for x in edges}
    for mar in XPath('./w:%s' % name)(parent):
        for x in edges:
            for edge in XPath('./w:%s' % x)(mar):
                ans[x] = _read_width(edge, get)
    for x in edges:
        setattr(dest, 'cell_padding_%s' % x, ans[x])
 def read_justification(parent, dest, XPath, get):
    left = right = inherit
    for jc in XPath('./w:jc[@w:val]')(parent):
        val = get(jc, 'w:val')
        if not val:
            continue
        if val == 'left':
            right = 'auto'
        elif val == 'right':
            left = 'auto'
        elif val == 'center':
            left = right = 'auto'
    setattr(dest, 'margin_left', left)
    setattr(dest, 'margin_right', right)
 def read_spacing(parent, dest, XPath, get):
    ans = inherit
    for cs in XPath('./w:tblCellSpacing')(parent):
        ans = _read_width(cs, get)
    setattr(dest, 'spacing', ans)
 def read_float(parent, dest, XPath, get):
    ans = inherit
    for x in XPath('./w:tblpPr')(parent):
        ans = {k.rpartition('}')[-1]: v for k, v in iteritems(x.attrib)}
    setattr(dest, 'float', ans)
 def read_indent(parent, dest, XPath, get):
    ans = inherit
    for cs in XPath('./w:tblInd')(parent):
        ans = _read_width(cs, get)
    setattr(dest, 'indent', ans)
 border_edges = ('left', 'top', 'right', 'bottom', 'insideH', 'insideV')
 def read_borders(parent, dest, XPath, get):
    name = 'tblBorders' if parent.tag.endswith('}tblPr') else 'tcBorders'
    read_border(parent, dest, XPath, get, border_edges, name)
 def read_height(parent, dest, XPath, get):
    ans = inherit
    for rh in XPath('./w:trHeight')(parent):
        rule = get(rh, 'w:hRule', 'auto')
        if rule in {'auto', 'atLeast', 'exact'}:
            val = get(rh, 'w:val')
            ans = (rule, val)
    setattr(dest, 'height', ans)
 def read_vertical_align(parent, dest, XPath, get):
    ans = inherit
    for va in XPath('./w:vAlign')(parent):
        val = get(va, 'w:val')
        ans = {'center': 'middle', 'top': 'top', 'bottom': 'bottom'}.get(val, 'middle')
    setattr(dest, 'vertical_align', ans)
 def read_col_span(parent, dest, XPath, get):
    ans = inherit
    for gs in XPath('./w:gridSpan')(parent):
        try:
            ans = int(get(gs, 'w:val'))
        except (TypeError, ValueError):
            continue
    setattr(dest, 'col_span', ans)
 def read_merge(parent, dest, XPath, get):
    for x in ('hMerge', 'vMerge'):
        ans = inherit
        for m in XPath('./w:%s' % x)(parent):
            ans = get(m, 'w:val', 'continue')
        setattr(dest, x, ans)
 def read_band_size(parent, dest, XPath, get):
    for x in ('Col', 'Row'):
        ans = 1
        for y in XPath('./w:tblStyle%sBandSize' % x)(parent):
            try:
                ans = int(get(y, 'w:val'))
            except (TypeError, ValueError):
                continue
        setattr(dest, '%s_band_size' % x.lower(), ans)
 def read_look(parent, dest, XPath, get):
    ans = 0
    for x in XPath('./w:tblLook')(parent):
        try:
            ans = int(get(x, 'w:val'), 16)
        except (ValueError, TypeError):
            continue
    setattr(dest, 'look', ans)
 # }}}
 def clone(style):
    if style is None:
        return None
    try:
        ans = type(style)(style.namespace)
    except TypeError:
        return None
    ans.update(style)
    return ans
 class Style(object):
    is_bidi = False
    def update(self, other):
        for prop in self.all_properties:
            nval = getattr(other, prop)
            if nval is not inherit:
                setattr(self, prop, nval)
    def apply_bidi(self):
        self.is_bidi = True
    def convert_spacing(self):
        ans = {}
        if self.spacing is not inherit:
            if self.spacing in {'auto', '0'}:
                ans['border-collapse'] = 'collapse'
            else:
                ans['border-collapse'] = 'separate'
                ans['border-spacing'] = self.spacing
        return ans
    def convert_border(self):
        c = {}
        for x in edges:
            border_to_css(x, self, c)
            val = getattr(self, 'padding_%s' % x)
            if val is not inherit:
                c['padding-%s' % x] = '%.3gpt' % val
        if self.is_bidi:
            for a in ('padding-%s', 'border-%s-style', 'border-%s-color', 'border-%s-width'):
                l, r = c.get(a % 'left'), c.get(a % 'right')
                if l is not None:
                    c[a % 'right'] = l
                if r is not None:
                    c[a % 'left'] = r
        return c
 class RowStyle(Style):
    all_properties = ('height', 'cantSplit', 'hidden', 'spacing',)
    def __init__(self, namespace, trPr=None):
        self.namespace = namespace
        if trPr is None:
            for p in self.all_properties:
                setattr(self, p, inherit)
        else:
            for p in ('hidden', 'cantSplit'):
                setattr(self, p, binary_property(trPr, p, namespace.XPath, namespace.get))
            for p in ('spacing', 'height'):
                f = globals()['read_%s' % p]
                f(trPr, self, namespace.XPath, namespace.get)
        self._css = None
    @property
    def css(self):
        if self._css is None:
            c = self._css = {}
            if self.hidden is True:
                c['display'] = 'none'
            if self.cantSplit is True:
                c['page-break-inside'] = 'avoid'
            if self.height is not inherit:
                rule, val = self.height
                if rule != 'auto':
                    try:
                        c['min-height' if rule == 'atLeast' else 'height'] = '%.3gpt' % (int(val)/20)
                    except (ValueError, TypeError):
                        pass
            c.update(self.convert_spacing())
        return self._css
 class CellStyle(Style):
    all_properties = ('background_color', 'cell_padding_left', 'cell_padding_right', 'cell_padding_top',
        'cell_padding_bottom', 'width', 'vertical_align', 'col_span', 'vMerge', 'hMerge', 'row_span',
    ) + tuple(k % edge for edge in border_edges for k in border_props)
    def __init__(self, namespace, tcPr=None):
        self.namespace = namespace
        if tcPr is None:
            for p in self.all_properties:
                setattr(self, p, inherit)
        else:
            for x in ('borders', 'shd', 'padding', 'cell_width', 'vertical_align', 'col_span', 'merge'):
                f = globals()['read_%s' % x]
                f(tcPr, self, namespace.XPath, namespace.get)
            self.row_span = inherit
        self._css = None
    @property
    def css(self):
        if self._css is None:
            self._css = c = {}
            if self.background_color is not inherit:
                c['background-color'] = self.background_color
            if self.width not in (inherit, 'auto'):
                c['width'] = self.width
            c['vertical-align'] = 'top' if self.vertical_align is inherit else self.vertical_align
            for x in edges:
                val = getattr(self, 'cell_padding_%s' % x)
                if val not in (inherit, 'auto'):
                    c['padding-%s' % x] =  val
                elif val is inherit and x in {'left', 'right'}:
                    c['padding-%s' % x] = '%.3gpt' % (115/20)
            # In Word, tables are apparently rendered with some default top and
            # bottom padding irrespective of the cellMargin values. Simulate
            # that here.
            for x in ('top', 'bottom'):
                if c.get('padding-%s' % x, '0pt') == '0pt':
                    c['padding-%s' % x] = '0.5ex'
            c.update(self.convert_border())
        return self._css
 class TableStyle(Style):
    all_properties = (
        'width', 'float', 'cell_padding_left', 'cell_padding_right', 'cell_padding_top',
        'cell_padding_bottom', 'margin_left', 'margin_right', 'background_color',
        'spacing', 'indent', 'overrides', 'col_band_size', 'row_band_size', 'look', 'bidi',
    ) + tuple(k % edge for edge in border_edges for k in border_props)
    def __init__(self, namespace, tblPr=None):
        self.namespace = namespace
        if tblPr is None:
            for p in self.all_properties:
                setattr(self, p, inherit)
        else:
            self.overrides = inherit
            self.bidi = binary_property(tblPr, 'bidiVisual', namespace.XPath, namespace.get)
            for x in ('width', 'float', 'padding', 'shd', 'justification', 'spacing', 'indent', 'borders', 'band_size', 'look'):
                f = globals()['read_%s' % x]
                f(tblPr, self, self.namespace.XPath, self.namespace.get)
            parent = tblPr.getparent()
            if self.namespace.is_tag(parent, 'w:style'):
                self.overrides = {}
                for tblStylePr in self.namespace.XPath('./w:tblStylePr[@w:type]')(parent):
                    otype = self.namespace.get(tblStylePr, 'w:type')
                    orides = self.overrides[otype] = {}
                    for tblPr in self.namespace.XPath('./w:tblPr')(tblStylePr):
                        orides['table'] = TableStyle(self.namespace, tblPr)
                    for trPr in self.namespace.XPath('./w:trPr')(tblStylePr):
                        orides['row'] = RowStyle(self.namespace, trPr)
                    for tcPr in self.namespace.XPath('./w:tcPr')(tblStylePr):
                        orides['cell'] = CellStyle(self.namespace, tcPr)
                    for pPr in self.namespace.XPath('./w:pPr')(tblStylePr):
                        orides['para'] = ParagraphStyle(self.namespace, pPr)
                    for rPr in self.namespace.XPath('./w:rPr')(tblStylePr):
                        orides['run'] = RunStyle(self.namespace, rPr)
        self._css = None
    def resolve_based_on(self, parent):
        for p in self.all_properties:
            val = getattr(self, p)
            if val is inherit:
                setattr(self, p, getattr(parent, p))
    @property
    def css(self):
        if self._css is None:
            c = self._css = {}
            if self.width not in (inherit, 'auto'):
                c['width'] = self.width
            for x in ('background_color', 'margin_left', 'margin_right'):
                val = getattr(self, x)
                if val is not inherit:
                    c[x.replace('_', '-')] = val
            if self.indent not in (inherit, 'auto') and self.margin_left != 'auto':
                c['margin-left'] = self.indent
            if self.float is not inherit:
                for x in ('left', 'top', 'right', 'bottom'):
                    val = self.float.get('%sFromText' % x, 0)
                    try:
                        val = '%.3gpt' % (int(val) / 20)
                    except (ValueError, TypeError):
                        val = '0'
                    c['margin-%s' % x] = val
                if 'tblpXSpec' in self.float:
                    c['float'] = 'right' if self.float['tblpXSpec'] in {'right', 'outside'} else 'left'
                else:
                    page = self.page
                    page_width = page.width - page.margin_left - page.margin_right
                    try:
                        x = int(self.float['tblpX']) / 20
                    except (KeyError, ValueError, TypeError):
                        x = 0
                    c['float'] = 'left' if (x/page_width) < 0.65 else 'right'
            c.update(self.convert_spacing())
            if 'border-collapse' not in c:
                c['border-collapse'] = 'collapse'
            c.update(self.convert_border())
        return self._css
 class Table(object):
    def __init__(self, namespace, tbl, styles, para_map, is_sub_table=False):
        self.namespace = namespace
        self.tbl = tbl
        self.styles = styles
        self.is_sub_table = is_sub_table
        # Read Table Style
        style = {'table':TableStyle(self.namespace)}
        for tblPr in self.namespace.XPath('./w:tblPr')(tbl):
            for ts in self.namespace.XPath('./w:tblStyle[@w:val]')(tblPr):
                style_id = self.namespace.get(ts, 'w:val')
                s = styles.get(style_id)
                if s is not None:
                    if s.table_style is not None:
                        style['table'].update(s.table_style)
                    if s.paragraph_style is not None:
                        if 'paragraph' in style:
                            style['paragraph'].update(s.paragraph_style)
                        else:
                            style['paragraph'] = s.paragraph_style
                    if s.character_style is not None:
                        if 'run' in style:
                            style['run'].update(s.character_style)
                        else:
                            style['run'] = s.character_style
            style['table'].update(TableStyle(self.namespace, tblPr))
        self.table_style, self.paragraph_style = style['table'], style.get('paragraph', None)
        self.run_style = style.get('run', None)
        self.overrides = self.table_style.overrides
        if self.overrides is inherit:
            self.overrides = {}
        if 'wholeTable' in self.overrides and 'table' in self.overrides['wholeTable']:
            self.table_style.update(self.overrides['wholeTable']['table'])
        self.style_map = {}
        self.paragraphs = []
        self.cell_map = []
        rows = self.namespace.XPath('./w:tr')(tbl)
        for r, tr in enumerate(rows):
            overrides = self.get_overrides(r, None, len(rows), None)
            self.resolve_row_style(tr, overrides)
            cells = self.namespace.XPath('./w:tc')(tr)
            self.cell_map.append([])
            for c, tc in enumerate(cells):
                overrides = self.get_overrides(r, c, len(rows), len(cells))
                self.resolve_cell_style(tc, overrides, r, c, len(rows), len(cells))
                self.cell_map[-1].append(tc)
                for p in self.namespace.XPath('./w:p')(tc):
                    para_map[p] = self
                    self.paragraphs.append(p)
                    self.resolve_para_style(p, overrides)
        self.handle_merged_cells()
        self.sub_tables = {x:Table(namespace, x, styles, para_map, is_sub_table=True) for x in self.namespace.XPath('./w:tr/w:tc/w:tbl')(tbl)}
    @property
    def bidi(self):
        return self.table_style.bidi is True
    def override_allowed(self, name):
        'Check if the named override is allowed by the tblLook element'
        if name.endswith('Cell') or name == 'wholeTable':
            return True
        look = self.table_style.look
        if (look & 0x0020 and name == 'firstRow') or (look & 0x0040 and name == 'lastRow') or \
           (look & 0x0080 and name == 'firstCol') or (look & 0x0100 and name == 'lastCol'):
            return True
        if name.startswith('band'):
            if name.endswith('Horz'):
                return not bool(look & 0x0200)
            if name.endswith('Vert'):
                return not bool(look & 0x0400)
        return False
    def get_overrides(self, r, c, num_of_rows, num_of_cols_in_row):
        'List of possible overrides for the given para'
        overrides = ['wholeTable']
        def divisor(m, n):
            return (m - (m % n)) // n
        if c is not None:
            odd_column_band = (divisor(c, self.table_style.col_band_size) % 2) == 1
            overrides.append('band%dVert' % (1 if odd_column_band else 2))
        odd_row_band = (divisor(r, self.table_style.row_band_size) % 2) == 1
        overrides.append('band%dHorz' % (1 if odd_row_band else 2))
        # According to the OOXML spec columns should have higher override
        # priority than rows, but Word seems to do it the other way around.
        if c is not None:
            if c == 0:
                overrides.append('firstCol')
            if c >= num_of_cols_in_row - 1:
                overrides.append('lastCol')
        if r == 0:
            overrides.append('firstRow')
        if r >= num_of_rows - 1:
            overrides.append('lastRow')
        if c is not None:
            if r == 0:
                if c == 0:
                    overrides.append('nwCell')
                if c == num_of_cols_in_row - 1:
                    overrides.append('neCell')
            if r == num_of_rows - 1:
                if c == 0:
                    overrides.append('swCell')
                if c == num_of_cols_in_row - 1:
                    overrides.append('seCell')
        return tuple(filter(self.override_allowed, overrides))
    def resolve_row_style(self, tr, overrides):
        rs = RowStyle(self.namespace)
        for o in overrides:
            if o in self.overrides:
                ovr = self.overrides[o]
                ors = ovr.get('row', None)
                if ors is not None:
                    rs.update(ors)
        for trPr in self.namespace.XPath('./w:trPr')(tr):
            rs.update(RowStyle(self.namespace, trPr))
        if self.bidi:
            rs.apply_bidi()
        self.style_map[tr] = rs
    def resolve_cell_style(self, tc, overrides, row, col, rows, cols_in_row):
        cs = CellStyle(self.namespace)
        for o in overrides:
            if o in self.overrides:
                ovr = self.overrides[o]
                ors = ovr.get('cell', None)
                if ors is not None:
                    cs.update(ors)
        for tcPr in self.namespace.XPath('./w:tcPr')(tc):
            cs.update(CellStyle(self.namespace, tcPr))
        for x in edges:
            p = 'cell_padding_%s' % x
            val = getattr(cs, p)
            if val is inherit:
                setattr(cs, p, getattr(self.table_style, p))
            is_inside_edge = (
                (x == 'left' and col > 0) or
                (x == 'top' and row > 0) or
                (x == 'right' and col < cols_in_row - 1) or
                (x == 'bottom' and row < rows -1)
            )
            inside_edge = ('insideH' if x in {'top', 'bottom'} else 'insideV') if is_inside_edge else None
            for prop in border_props:
                if not prop.startswith('border'):
                    continue
                eprop = prop % x
                iprop = (prop % inside_edge) if inside_edge else None
                val = getattr(cs, eprop)
                if val is inherit and iprop is not None:
                    # Use the insideX borders if the main cell borders are not
                    # specified
                    val = getattr(cs, iprop)
                    if val is inherit:
                        val = getattr(self.table_style, iprop)
                if not is_inside_edge and val == 'none':
                    # Cell borders must override table borders even when the
                    # table border is not null and the cell border is null.
                    val = 'hidden'
                setattr(cs, eprop, val)
        if self.bidi:
            cs.apply_bidi()
        self.style_map[tc] = cs
    def resolve_para_style(self, p, overrides):
        text_styles = [clone(self.paragraph_style), clone(self.run_style)]
        for o in overrides:
            if o in self.overrides:
                ovr = self.overrides[o]
                for i, name in enumerate(('para', 'run')):
                    ops = ovr.get(name, None)
                    if ops is not None:
                        if text_styles[i] is None:
                            text_styles[i] = ops
                        else:
                            text_styles[i].update(ops)
        self.style_map[p] = text_styles
    def handle_merged_cells(self):
        if not self.cell_map:
            return
        # Handle vMerge
        max_col_num = max(len(r) for r in self.cell_map)
        for c in range(max_col_num):
            cells = [row[c] if c < len(row) else None for row in self.cell_map]
            runs = [[]]
            for cell in cells:
                try:
                    s = self.style_map[cell]
                except KeyError:  # cell is None
                    s = CellStyle(self.namespace)
                if s.vMerge == 'restart':
                    runs.append([cell])
                elif s.vMerge == 'continue':
                    runs[-1].append(cell)
                else:
                    runs.append([])
            for run in runs:
                if len(run) > 1:
                    self.style_map[run[0]].row_span = len(run)
                    for tc in run[1:]:
                        tc.getparent().remove(tc)
        # Handle hMerge
        for cells in self.cell_map:
            runs = [[]]
            for cell in cells:
                try:
                    s = self.style_map[cell]
                except KeyError:  # cell is None
                    s = CellStyle(self.namespace)
                if s.col_span is not inherit:
                    runs.append([])
                    continue
                if s.hMerge == 'restart':
                    runs.append([cell])
                elif s.hMerge == 'continue':
                    runs[-1].append(cell)
                else:
                    runs.append([])
            for run in runs:
                if len(run) > 1:
                    self.style_map[run[0]].col_span = len(run)
                    for tc in run[1:]:
                        tc.getparent().remove(tc)
    def __iter__(self):
        for p in self.paragraphs:
            yield p
        for t in itervalues(self.sub_tables):
            for p in t:
                yield p
    def apply_markup(self, rmap, page, parent=None):
        table = TABLE('\n\t\t')
        if self.bidi:
            table.set('dir', 'rtl')
        self.table_style.page = page
        style_map = {}
        if parent is None:
            try:
                first_para = rmap[next(iter(self))]
            except StopIteration:
                return
            parent = first_para.getparent()
            idx = parent.index(first_para)
            parent.insert(idx, table)
        else:
            parent.append(table)
        for row in self.namespace.XPath('./w:tr')(self.tbl):
            tr = TR('\n\t\t\t')
            style_map[tr] = self.style_map[row]
            tr.tail = '\n\t\t'
            table.append(tr)
            for tc in self.namespace.XPath('./w:tc')(row):
                td = TD()
                style_map[td] = s = self.style_map[tc]
                if s.col_span is not inherit:
                    td.set('colspan', unicode_type(s.col_span))
                if s.row_span is not inherit:
                    td.set('rowspan', unicode_type(s.row_span))
                td.tail = '\n\t\t\t'
                tr.append(td)
                for x in self.namespace.XPath('./w:p|./w:tbl')(tc):
                    if x.tag.endswith('}p'):
                        td.append(rmap[x])
                    else:
                        self.sub_tables[x].apply_markup(rmap, page, parent=td)
            if len(tr):
                tr[-1].tail = '\n\t\t'
        if len(table):
            table[-1].tail = '\n\t'
        table_style = self.table_style.css
        if table_style:
            table.set('class', self.styles.register(table_style, 'table'))
        for elem, style in iteritems(style_map):
            css = style.css
            if css:
                elem.set('class', self.styles.register(css, elem.tag))
 class Tables(object):
    def __init__(self, namespace):
        self.tables = []
        self.para_map = {}
        self.sub_tables = set()
        self.namespace = namespace
    def register(self, tbl, styles):
        if tbl in self.sub_tables:
            return
        self.tables.append(Table(self.namespace, tbl, styles, self.para_map))
        self.sub_tables |= set(self.tables[-1].sub_tables)
    def apply_markup(self, object_map, page_map):
        rmap = {v:k for k, v in iteritems(object_map)}
        for table in self.tables:
            table.apply_markup(rmap, page_map[table.tbl])
    def para_style(self, p):
        table = self.para_map.get(p, None)
        if table is not None:
            return table.style_map.get(p, (None, None))[0]
    def run_style(self, p):
        table = self.para_map.get(p, None)
        if table is not None:
            return table.style_map.get(p, (None, None))[1]
--- a/ebook_converter/ebooks/docx/theme.py
+++ b/ebook_converter/ebooks/docx/theme.py
@@ -0,0 +1,29 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 class Theme(object):
    def __init__(self, namespace):
        self.major_latin_font = 'Cambria'
        self.minor_latin_font = 'Calibri'
        self.namespace = namespace
    def __call__(self, root):
        for fs in self.namespace.XPath('//a:fontScheme')(root):
            for mj in self.namespace.XPath('./a:majorFont')(fs):
                for l in self.namespace.XPath('./a:latin[@typeface]')(mj):
                    self.major_latin_font = l.get('typeface')
            for mj in self.namespace.XPath('./a:minorFont')(fs):
                for l in self.namespace.XPath('./a:latin[@typeface]')(mj):
                    self.minor_latin_font = l.get('typeface')
    def resolve_font_family(self, ff):
        if ff.startswith('|'):
            ff = ff[1:-1]
            ff = self.major_latin_font if ff.startswith('major') else self.minor_latin_font
        return ff
--- a/ebook_converter/ebooks/docx/to_html.py
+++ b/ebook_converter/ebooks/docx/to_html.py
@@ -0,0 +1,839 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 import sys, os, re, math, errno, uuid, numbers
 from collections import OrderedDict, defaultdict
 from lxml import html
 from lxml.html.builder import (
    HTML, HEAD, TITLE, BODY, LINK, META, P, SPAN, BR, DIV, A, DT, DL, DD, H1)
 from calibre import guess_type
 from calibre.ebooks.docx.container import DOCX, fromstring
 from calibre.ebooks.docx.names import XML, generate_anchor
 from calibre.ebooks.docx.styles import Styles, inherit, PageProperties
 from calibre.ebooks.docx.numbering import Numbering
 from calibre.ebooks.docx.fonts import Fonts, is_symbol_font, map_symbol_text
 from calibre.ebooks.docx.images import Images
 from calibre.ebooks.docx.tables import Tables
 from calibre.ebooks.docx.footnotes import Footnotes
 from calibre.ebooks.docx.cleanup import cleanup_markup
 from calibre.ebooks.docx.theme import Theme
 from calibre.ebooks.docx.toc import create_toc
 from calibre.ebooks.docx.fields import Fields
 from calibre.ebooks.docx.settings import Settings
 from calibre.ebooks.metadata.opf2 import OPFCreator
 from calibre.utils.localization import canonicalize_lang, lang_as_iso639_1
 from polyglot.builtins import iteritems, itervalues, filter, getcwd, map, unicode_type
 NBSP = '\xa0'
 class Text:
    def __init__(self, elem, attr, buf):
        self.elem, self.attr, self.buf = elem, attr, buf
        self.elems = [self.elem]
    def add_elem(self, elem):
        self.elems.append(elem)
        setattr(self.elem, self.attr, ''.join(self.buf))
        self.elem, self.attr, self.buf = elem, 'tail', []
    def __iter__(self):
        return iter(self.elems)
 def html_lang(docx_lang):
    lang = canonicalize_lang(docx_lang)
    if lang and lang != 'und':
        lang = lang_as_iso639_1(lang)
        if lang:
            return lang
 class Convert(object):
    def __init__(self, path_or_stream, dest_dir=None, log=None, detect_cover=True, notes_text=None, notes_nopb=False, nosupsub=False):
        self.docx = DOCX(path_or_stream, log=log)
        self.namespace = self.docx.namespace
        self.ms_pat = re.compile(r'\s{2,}')
        self.ws_pat = re.compile(r'[\n\r\t]')
        self.log = self.docx.log
        self.detect_cover = detect_cover
        self.notes_text = notes_text or _('Notes')
        self.notes_nopb = notes_nopb
        self.nosupsub = nosupsub
        self.dest_dir = dest_dir or getcwd()
        self.mi = self.docx.metadata
        self.body = BODY()
        self.theme = Theme(self.namespace)
        self.settings = Settings(self.namespace)
        self.tables = Tables(self.namespace)
        self.fields = Fields(self.namespace)
        self.styles = Styles(self.namespace, self.tables)
        self.images = Images(self.namespace, self.log)
        self.object_map = OrderedDict()
        self.html = HTML(
            HEAD(
                META(charset='utf-8'),
                TITLE(self.mi.title or _('Unknown')),
                LINK(rel='stylesheet', type='text/css', href='docx.css'),
            ),
            self.body
        )
        self.html.text='\n\t'
        self.html[0].text='\n\t\t'
        self.html[0].tail='\n'
        for child in self.html[0]:
            child.tail = '\n\t\t'
        self.html[0][-1].tail = '\n\t'
        self.html[1].text = self.html[1].tail = '\n'
        lang = html_lang(self.mi.language)
        if lang:
            self.html.set('lang', lang)
            self.doc_lang = lang
        else:
            self.doc_lang = None
    def __call__(self):
        doc = self.docx.document
        relationships_by_id, relationships_by_type = self.docx.document_relationships
        self.resolve_alternate_content(doc)
        self.fields(doc, self.log)
        self.read_styles(relationships_by_type)
        self.images(relationships_by_id)
        self.layers = OrderedDict()
        self.framed = [[]]
        self.frame_map = {}
        self.framed_map = {}
        self.anchor_map = {}
        self.link_map = defaultdict(list)
        self.link_source_map = {}
        self.toc_anchor = None
        self.block_runs = []
        paras = []
        self.log.debug('Converting Word markup to HTML')
        self.read_page_properties(doc)
        self.current_rels = relationships_by_id
        for wp, page_properties in iteritems(self.page_map):
            self.current_page = page_properties
            if wp.tag.endswith('}p'):
                p = self.convert_p(wp)
                self.body.append(p)
                paras.append(wp)
        self.read_block_anchors(doc)
        self.styles.apply_contextual_spacing(paras)
        self.mark_block_runs(paras)
        # Apply page breaks at the start of every section, except the first
        # section (since that will be the start of the file)
        self.styles.apply_section_page_breaks(self.section_starts[1:])
        notes_header = None
        orig_rid_map = self.images.rid_map
        if self.footnotes.has_notes:
            self.body.append(H1(self.notes_text))
            notes_header = self.body[-1]
            notes_header.set('class', 'notes-header')
            for anchor, text, note in self.footnotes:
                dl = DL(id=anchor)
                dl.set('class', 'footnote')
                self.body.append(dl)
                dl.append(DT('[', A('←' + text, href='#back_%s' % anchor, title=text)))
                dl[-1][0].tail = ']'
                dl.append(DD())
                paras = []
                self.images.rid_map = self.current_rels = note.rels[0]
                for wp in note:
                    if wp.tag.endswith('}tbl'):
                        self.tables.register(wp, self.styles)
                        self.page_map[wp] = self.current_page
                    else:
                        p = self.convert_p(wp)
                        dl[-1].append(p)
                        paras.append(wp)
                self.styles.apply_contextual_spacing(paras)
                self.mark_block_runs(paras)
        for p, wp in iteritems(self.object_map):
            if len(p) > 0 and not p.text and len(p[0]) > 0 and not p[0].text and p[0][0].get('class', None) == 'tab':
                # Paragraph uses tabs for indentation, convert to text-indent
                parent = p[0]
                tabs = []
                for child in parent:
                    if child.get('class', None) == 'tab':
                        tabs.append(child)
                        if child.tail:
                            break
                    else:
                        break
                indent = len(tabs) * self.settings.default_tab_stop
                style = self.styles.resolve(wp)
                if style.text_indent is inherit or (hasattr(style.text_indent, 'endswith') and style.text_indent.endswith('pt')):
                    if style.text_indent is not inherit:
                        indent = float(style.text_indent[:-2]) + indent
                    style.text_indent = '%.3gpt' % indent
                    parent.text = tabs[-1].tail or ''
                    list(map(parent.remove, tabs))
        self.images.rid_map = orig_rid_map
        self.resolve_links()
        self.styles.cascade(self.layers)
        self.tables.apply_markup(self.object_map, self.page_map)
        numbered = []
        for html_obj, obj in iteritems(self.object_map):
            raw = obj.get('calibre_num_id', None)
            if raw is not None:
                lvl, num_id = raw.partition(':')[0::2]
                try:
                    lvl = int(lvl)
                except (TypeError, ValueError):
                    lvl = 0
                numbered.append((html_obj, num_id, lvl))
        self.numbering.apply_markup(numbered, self.body, self.styles, self.object_map, self.images)
        self.apply_frames()
        if len(self.body) > 0:
            self.body.text = '\n\t'
            for child in self.body:
                child.tail = '\n\t'
            self.body[-1].tail = '\n'
        self.log.debug('Converting styles to CSS')
        self.styles.generate_classes()
        for html_obj, obj in iteritems(self.object_map):
            style = self.styles.resolve(obj)
            if style is not None:
                css = style.css
                if css:
                    cls = self.styles.class_name(css)
                    if cls:
                        html_obj.set('class', cls)
        for html_obj, css in iteritems(self.framed_map):
            cls = self.styles.class_name(css)
            if cls:
                html_obj.set('class', cls)
        if notes_header is not None:
            for h in self.namespace.children(self.body, 'h1', 'h2', 'h3'):
                notes_header.tag = h.tag
                cls = h.get('class', None)
                if cls and cls != 'notes-header':
                    notes_header.set('class', '%s notes-header' % cls)
                break
        self.fields.polish_markup(self.object_map)
        self.log.debug('Cleaning up redundant markup generated by Word')
        self.cover_image = cleanup_markup(self.log, self.html, self.styles, self.dest_dir, self.detect_cover, self.namespace.XPath)
        return self.write(doc)
    def read_page_properties(self, doc):
        current = []
        self.page_map = OrderedDict()
        self.section_starts = []
        for p in self.namespace.descendants(doc, 'w:p', 'w:tbl'):
            if p.tag.endswith('}tbl'):
                self.tables.register(p, self.styles)
                current.append(p)
                continue
            sect = tuple(self.namespace.descendants(p, 'w:sectPr'))
            if sect:
                pr = PageProperties(self.namespace, sect)
                paras = current + [p]
                for x in paras:
                    self.page_map[x] = pr
                self.section_starts.append(paras[0])
                current = []
            else:
                current.append(p)
        if current:
            self.section_starts.append(current[0])
            last = self.namespace.XPath('./w:body/w:sectPr')(doc)
            pr = PageProperties(self.namespace, last)
            for x in current:
                self.page_map[x] = pr
    def resolve_alternate_content(self, doc):
        # For proprietary extensions in Word documents use the fallback, spec
        # compliant form
        # See https://wiki.openoffice.org/wiki/OOXML/Markup_Compatibility_and_Extensibility
        for ac in self.namespace.descendants(doc, 'mc:AlternateContent'):
            choices = self.namespace.XPath('./mc:Choice')(ac)
            fallbacks = self.namespace.XPath('./mc:Fallback')(ac)
            if fallbacks:
                for choice in choices:
                    ac.remove(choice)
    def read_styles(self, relationships_by_type):
        def get_name(rtype, defname):
            name = relationships_by_type.get(rtype, None)
            if name is None:
                cname = self.docx.document_name.split('/')
                cname[-1] = defname
                if self.docx.exists('/'.join(cname)):
                    name = name
            if name and name.startswith('word/word') and not self.docx.exists(name):
                name = name.partition('/')[2]
            return name
        nname = get_name(self.namespace.names['NUMBERING'], 'numbering.xml')
        sname = get_name(self.namespace.names['STYLES'], 'styles.xml')
        sename = get_name(self.namespace.names['SETTINGS'], 'settings.xml')
        fname = get_name(self.namespace.names['FONTS'], 'fontTable.xml')
        tname = get_name(self.namespace.names['THEMES'], 'theme1.xml')
        foname = get_name(self.namespace.names['FOOTNOTES'], 'footnotes.xml')
        enname = get_name(self.namespace.names['ENDNOTES'], 'endnotes.xml')
        numbering = self.numbering = Numbering(self.namespace)
        footnotes = self.footnotes = Footnotes(self.namespace)
        fonts = self.fonts = Fonts(self.namespace)
        foraw = enraw = None
        forel, enrel = ({}, {}), ({}, {})
        if sename is not None:
            try:
                seraw = self.docx.read(sename)
            except KeyError:
                self.log.warn('Settings %s do not exist' % sename)
            except EnvironmentError as e:
                if e.errno != errno.ENOENT:
                    raise
                self.log.warn('Settings %s file missing' % sename)
            else:
                self.settings(fromstring(seraw))
        if foname is not None:
            try:
                foraw = self.docx.read(foname)
            except KeyError:
                self.log.warn('Footnotes %s do not exist' % foname)
            else:
                forel = self.docx.get_relationships(foname)
        if enname is not None:
            try:
                enraw = self.docx.read(enname)
            except KeyError:
                self.log.warn('Endnotes %s do not exist' % enname)
            else:
                enrel = self.docx.get_relationships(enname)
        footnotes(fromstring(foraw) if foraw else None, forel, fromstring(enraw) if enraw else None, enrel)
        if fname is not None:
            embed_relationships = self.docx.get_relationships(fname)[0]
            try:
                raw = self.docx.read(fname)
            except KeyError:
                self.log.warn('Fonts table %s does not exist' % fname)
            else:
                fonts(fromstring(raw), embed_relationships, self.docx, self.dest_dir)
        if tname is not None:
            try:
                raw = self.docx.read(tname)
            except KeyError:
                self.log.warn('Styles %s do not exist' % sname)
            else:
                self.theme(fromstring(raw))
        styles_loaded = False
        if sname is not None:
            try:
                raw = self.docx.read(sname)
            except KeyError:
                self.log.warn('Styles %s do not exist' % sname)
            else:
                self.styles(fromstring(raw), fonts, self.theme)
                styles_loaded = True
        if not styles_loaded:
            self.styles(None, fonts, self.theme)
        if nname is not None:
            try:
                raw = self.docx.read(nname)
            except KeyError:
                self.log.warn('Numbering styles %s do not exist' % nname)
            else:
                numbering(fromstring(raw), self.styles, self.docx.get_relationships(nname)[0])
        self.styles.resolve_numbering(numbering)
    def write(self, doc):
        toc = create_toc(doc, self.body, self.resolved_link_map, self.styles, self.object_map, self.log, self.namespace)
        raw = html.tostring(self.html, encoding='utf-8', doctype='<!DOCTYPE html>')
        with lopen(os.path.join(self.dest_dir, 'index.html'), 'wb') as f:
            f.write(raw)
        css = self.styles.generate_css(self.dest_dir, self.docx, self.notes_nopb, self.nosupsub)
        if css:
            with lopen(os.path.join(self.dest_dir, 'docx.css'), 'wb') as f:
                f.write(css.encode('utf-8'))
        opf = OPFCreator(self.dest_dir, self.mi)
        opf.toc = toc
        opf.create_manifest_from_files_in([self.dest_dir])
        for item in opf.manifest:
            if item.media_type == 'text/html':
                item.media_type = guess_type('a.xhtml')[0]
        opf.create_spine(['index.html'])
        if self.cover_image is not None:
            opf.guide.set_cover(self.cover_image)
        def process_guide(E, guide):
            if self.toc_anchor is not None:
                guide.append(E.reference(
                    href='index.html#' + self.toc_anchor, title=_('Table of Contents'), type='toc'))
        toc_file = os.path.join(self.dest_dir, 'toc.ncx')
        with lopen(os.path.join(self.dest_dir, 'metadata.opf'), 'wb') as of, open(toc_file, 'wb') as ncx:
            opf.render(of, ncx, 'toc.ncx', process_guide=process_guide)
        if os.path.getsize(toc_file) == 0:
            os.remove(toc_file)
        return os.path.join(self.dest_dir, 'metadata.opf')
    def read_block_anchors(self, doc):
        doc_anchors = frozenset(self.namespace.XPath('./w:body/w:bookmarkStart[@w:name]')(doc))
        if doc_anchors:
            current_bm = set()
            rmap = {v:k for k, v in iteritems(self.object_map)}
            for p in self.namespace.descendants(doc, 'w:p', 'w:bookmarkStart[@w:name]'):
                if p.tag.endswith('}p'):
                    if current_bm and p in rmap:
                        para = rmap[p]
                        if 'id' not in para.attrib:
                            para.set('id', generate_anchor(next(iter(current_bm)), frozenset(itervalues(self.anchor_map))))
                        for name in current_bm:
                            self.anchor_map[name] = para.get('id')
                        current_bm = set()
                elif p in doc_anchors:
                    anchor = self.namespace.get(p, 'w:name')
                    if anchor:
                        current_bm.add(anchor)
    def convert_p(self, p):
        dest = P()
        self.object_map[dest] = p
        style = self.styles.resolve_paragraph(p)
        self.layers[p] = []
        self.frame_map[p] = style.frame
        self.add_frame(dest, style.frame)
        current_anchor = None
        current_hyperlink = None
        hl_xpath = self.namespace.XPath('ancestor::w:hyperlink[1]')
        def p_parent(x):
            # Ensure that nested <w:p> tags are handled. These can occur if a
            # textbox is present inside a paragraph.
            while True:
                x = x.getparent()
                try:
                    if x.tag.endswith('}p'):
                        return x
                except AttributeError:
                    break
        for x in self.namespace.descendants(p, 'w:r', 'w:bookmarkStart', 'w:hyperlink', 'w:instrText'):
            if p_parent(x) is not p:
                continue
            if x.tag.endswith('}r'):
                span = self.convert_run(x)
                if current_anchor is not None:
                    (dest if len(dest) == 0 else span).set('id', current_anchor)
                    current_anchor = None
                if current_hyperlink is not None:
                    try:
                        hl = hl_xpath(x)[0]
                        self.link_map[hl].append(span)
                        self.link_source_map[hl] = self.current_rels
                        x.set('is-link', '1')
                    except IndexError:
                        current_hyperlink = None
                dest.append(span)
                self.layers[p].append(x)
            elif x.tag.endswith('}bookmarkStart'):
                anchor = self.namespace.get(x, 'w:name')
                if anchor and anchor not in self.anchor_map and anchor != '_GoBack':
                    # _GoBack is a special bookmark inserted by Word 2010 for
                    # the return to previous edit feature, we ignore it
                    old_anchor = current_anchor
                    self.anchor_map[anchor] = current_anchor = generate_anchor(anchor, frozenset(itervalues(self.anchor_map)))
                    if old_anchor is not None:
                        # The previous anchor was not applied to any element
                        for a, t in tuple(iteritems(self.anchor_map)):
                            if t == old_anchor:
                                self.anchor_map[a] = current_anchor
            elif x.tag.endswith('}hyperlink'):
                current_hyperlink = x
            elif x.tag.endswith('}instrText') and x.text and x.text.strip().startswith('TOC '):
                old_anchor = current_anchor
                anchor = unicode_type(uuid.uuid4())
                self.anchor_map[anchor] = current_anchor = generate_anchor('toc', frozenset(itervalues(self.anchor_map)))
                self.toc_anchor = current_anchor
                if old_anchor is not None:
                    # The previous anchor was not applied to any element
                    for a, t in tuple(iteritems(self.anchor_map)):
                        if t == old_anchor:
                            self.anchor_map[a] = current_anchor
        if current_anchor is not None:
            # This paragraph had no <w:r> descendants
            dest.set('id', current_anchor)
            current_anchor = None
        m = re.match(r'heading\s+(\d+)$', style.style_name or '', re.IGNORECASE)
        if m is not None:
            n = min(6, max(1, int(m.group(1))))
            dest.tag = 'h%d' % n
            dest.set('data-heading-level', unicode_type(n))
        if style.bidi is True:
            dest.set('dir', 'rtl')
        border_runs = []
        common_borders = []
        for span in dest:
            run = self.object_map[span]
            style = self.styles.resolve_run(run)
            if not border_runs or border_runs[-1][1].same_border(style):
                border_runs.append((span, style))
            elif border_runs:
                if len(border_runs) > 1:
                    common_borders.append(border_runs)
                border_runs = []
        for border_run in common_borders:
            spans = []
            bs = {}
            for span, style in border_run:
                style.get_border_css(bs)
                style.clear_border_css()
                spans.append(span)
            if bs:
                cls = self.styles.register(bs, 'text_border')
                wrapper = self.wrap_elems(spans, SPAN())
                wrapper.set('class', cls)
        if not dest.text and len(dest) == 0 and not style.has_visible_border():
            # Empty paragraph add a non-breaking space so that it is rendered
            # by WebKit
            dest.text = NBSP
        # If the last element in a block is a <br> the <br> is not rendered in
        # HTML, unless it is followed by a trailing space. Word, on the other
        # hand inserts a blank line for trailing <br>s.
        if len(dest) > 0 and not dest[-1].tail:
            if dest[-1].tag == 'br':
                dest[-1].tail = NBSP
            elif len(dest[-1]) > 0 and dest[-1][-1].tag == 'br' and not dest[-1][-1].tail:
                dest[-1][-1].tail = NBSP
        return dest
    def wrap_elems(self, elems, wrapper):
        p = elems[0].getparent()
        idx = p.index(elems[0])
        p.insert(idx, wrapper)
        wrapper.tail = elems[-1].tail
        elems[-1].tail = None
        for elem in elems:
            try:
                p.remove(elem)
            except ValueError:
                # Probably a hyperlink that spans multiple
                # paragraphs,theoretically we should break this up into
                # multiple hyperlinks, but I can't be bothered.
                elem.getparent().remove(elem)
            wrapper.append(elem)
        return wrapper
    def resolve_links(self):
        self.resolved_link_map = {}
        for hyperlink, spans in iteritems(self.link_map):
            relationships_by_id = self.link_source_map[hyperlink]
            span = spans[0]
            if len(spans) > 1:
                span = self.wrap_elems(spans, SPAN())
            span.tag = 'a'
            self.resolved_link_map[hyperlink] = span
            tgt = self.namespace.get(hyperlink, 'w:tgtFrame')
            if tgt:
                span.set('target', tgt)
            tt = self.namespace.get(hyperlink, 'w:tooltip')
            if tt:
                span.set('title', tt)
            rid = self.namespace.get(hyperlink, 'r:id')
            if rid and rid in relationships_by_id:
                span.set('href', relationships_by_id[rid])
                continue
            anchor = self.namespace.get(hyperlink, 'w:anchor')
            if anchor and anchor in self.anchor_map:
                span.set('href', '#' + self.anchor_map[anchor])
                continue
            self.log.warn('Hyperlink with unknown target (rid=%s, anchor=%s), ignoring' %
                          (rid, anchor))
            # hrefs that point nowhere give epubcheck a hernia. The element
            # should be styled explicitly by Word anyway.
            # span.set('href', '#')
        rmap = {v:k for k, v in iteritems(self.object_map)}
        for hyperlink, runs in self.fields.hyperlink_fields:
            spans = [rmap[r] for r in runs if r in rmap]
            if not spans:
                continue
            span = spans[0]
            if len(spans) > 1:
                span = self.wrap_elems(spans, SPAN())
            span.tag = 'a'
            tgt = hyperlink.get('target', None)
            if tgt:
                span.set('target', tgt)
            tt = hyperlink.get('title', None)
            if tt:
                span.set('title', tt)
            url = hyperlink.get('url', None)
            if url is None:
                anchor = hyperlink.get('anchor', None)
                if anchor in self.anchor_map:
                    span.set('href', '#' + self.anchor_map[anchor])
                    continue
                self.log.warn('Hyperlink field with unknown anchor: %s' % anchor)
            else:
                if url in self.anchor_map:
                    span.set('href', '#' + self.anchor_map[url])
                    continue
                span.set('href', url)
        for img, link, relationships_by_id in self.images.links:
            parent = img.getparent()
            idx = parent.index(img)
            a = A(img)
            a.tail, img.tail = img.tail, None
            parent.insert(idx, a)
            tgt = link.get('target', None)
            if tgt:
                a.set('target', tgt)
            tt = link.get('title', None)
            if tt:
                a.set('title', tt)
            rid = link['id']
            if rid in relationships_by_id:
                dest = relationships_by_id[rid]
                if dest.startswith('#'):
                    if dest[1:] in self.anchor_map:
                        a.set('href', '#' + self.anchor_map[dest[1:]])
                else:
                    a.set('href', dest)
    def convert_run(self, run):
        ans = SPAN()
        self.object_map[ans] = run
        text = Text(ans, 'text', [])
        for child in run:
            if self.namespace.is_tag(child, 'w:t'):
                if not child.text:
                    continue
                space = child.get(XML('space'), None)
                preserve = False
                ctext = child.text
                if space != 'preserve':
                    # Remove leading and trailing whitespace. Word ignores
                    # leading and trailing whitespace without preserve
                    ctext = ctext.strip(' \n\r\t')
                # Only use a <span> with white-space:pre-wrap if this element
                # actually needs it, i.e. if it has more than one
                # consecutive space or it has newlines or tabs.
                multi_spaces = self.ms_pat.search(ctext) is not None
                preserve = multi_spaces or self.ws_pat.search(ctext) is not None
                if preserve:
                    text.add_elem(SPAN(ctext, style="white-space:pre-wrap"))
                    ans.append(text.elem)
                else:
                    text.buf.append(ctext)
            elif self.namespace.is_tag(child, 'w:cr'):
                text.add_elem(BR())
                ans.append(text.elem)
            elif self.namespace.is_tag(child, 'w:br'):
                typ = self.namespace.get(child, 'w:type')
                if typ in {'column', 'page'}:
                    br = BR(style='page-break-after:always')
                else:
                    clear = child.get('clear', None)
                    if clear in {'all', 'left', 'right'}:
                        br = BR(style='clear:%s'%('both' if clear == 'all' else clear))
                    else:
                        br = BR()
                text.add_elem(br)
                ans.append(text.elem)
            elif self.namespace.is_tag(child, 'w:drawing') or self.namespace.is_tag(child, 'w:pict'):
                for img in self.images.to_html(child, self.current_page, self.docx, self.dest_dir):
                    text.add_elem(img)
                    ans.append(text.elem)
            elif self.namespace.is_tag(child, 'w:footnoteReference') or self.namespace.is_tag(child, 'w:endnoteReference'):
                anchor, name = self.footnotes.get_ref(child)
                if anchor and name:
                    l = A(name, id='back_%s' % anchor, href='#' + anchor, title=name)
                    l.set('class', 'noteref')
                    text.add_elem(l)
                    ans.append(text.elem)
            elif self.namespace.is_tag(child, 'w:tab'):
                spaces = int(math.ceil((self.settings.default_tab_stop / 36) * 6))
                text.add_elem(SPAN(NBSP * spaces))
                ans.append(text.elem)
                ans[-1].set('class', 'tab')
            elif self.namespace.is_tag(child, 'w:noBreakHyphen'):
                text.buf.append('\u2011')
            elif self.namespace.is_tag(child, 'w:softHyphen'):
                text.buf.append('\u00ad')
        if text.buf:
            setattr(text.elem, text.attr, ''.join(text.buf))
        style = self.styles.resolve_run(run)
        if style.vert_align in {'superscript', 'subscript'}:
            if ans.text or len(ans):
                ans.set('data-docx-vert', 'sup' if style.vert_align == 'superscript' else 'sub')
        if style.lang is not inherit:
            lang = html_lang(style.lang)
            if lang is not None and lang != self.doc_lang:
                ans.set('lang', lang)
        if style.rtl is True:
            ans.set('dir', 'rtl')
        if is_symbol_font(style.font_family):
            for elem in text:
                if elem.text:
                    elem.text = map_symbol_text(elem.text, style.font_family)
                if elem.tail:
                    elem.tail = map_symbol_text(elem.tail, style.font_family)
            style.font_family = 'sans-serif'
        return ans
    def add_frame(self, html_obj, style):
        last_run = self.framed[-1]
        if style is inherit:
            if last_run:
                self.framed.append([])
            return
        if last_run:
            if last_run[-1][1] == style:
                last_run.append((html_obj, style))
            else:
                self.framed[-1].append((html_obj, style))
        else:
            last_run.append((html_obj, style))
    def apply_frames(self):
        for run in filter(None, self.framed):
            style = run[0][1]
            paras = tuple(x[0] for x in run)
            parent = paras[0].getparent()
            idx = parent.index(paras[0])
            frame = DIV(*paras)
            parent.insert(idx, frame)
            self.framed_map[frame] = css = style.css(self.page_map[self.object_map[paras[0]]])
            self.styles.register(css, 'frame')
        if not self.block_runs:
            return
        rmap = {v:k for k, v in iteritems(self.object_map)}
        for border_style, blocks in self.block_runs:
            paras = tuple(rmap[p] for p in blocks)
            for p in paras:
                if p.tag == 'li':
                    has_li = True
                    break
            else:
                has_li = False
            parent = paras[0].getparent()
            if parent.tag in ('ul', 'ol'):
                ul = parent
                parent = ul.getparent()
                idx = parent.index(ul)
                frame = DIV(ul)
            elif has_li:
                def top_level_tag(x):
                    while True:
                        q = x.getparent()
                        if q is parent or q is None:
                            break
                        x = q
                    return x
                paras = tuple(map(top_level_tag, paras))
                idx = parent.index(paras[0])
                frame = DIV(*paras)
            else:
                idx = parent.index(paras[0])
                frame = DIV(*paras)
            parent.insert(idx, frame)
            self.framed_map[frame] = css = border_style.css
            self.styles.register(css, 'frame')
    def mark_block_runs(self, paras):
        def process_run(run):
            max_left = max_right = 0
            has_visible_border = None
            for p in run:
                style = self.styles.resolve_paragraph(p)
                if has_visible_border is None:
                    has_visible_border = style.has_visible_border()
                if isinstance(style.margin_left, numbers.Number):
                    max_left = max(style.margin_left, max_left)
                if isinstance(style.margin_right, numbers.Number):
                    max_right = max(style.margin_right, max_right)
                if has_visible_border:
                    style.margin_left = style.margin_right = inherit
                if p is not run[0]:
                    style.padding_top = 0
                else:
                    border_style = style.clone_border_styles()
                    if has_visible_border:
                        border_style.margin_top, style.margin_top = style.margin_top, inherit
                if p is not run[-1]:
                    style.padding_bottom = 0
                else:
                    if has_visible_border:
                        border_style.margin_bottom, style.margin_bottom = style.margin_bottom, inherit
                style.clear_borders()
                if p is not run[-1]:
                    style.apply_between_border()
            if has_visible_border:
                border_style.margin_left, border_style.margin_right = max_left,max_right
                self.block_runs.append((border_style, run))
        run = []
        for p in paras:
            if run and self.frame_map.get(p) == self.frame_map.get(run[-1]):
                style = self.styles.resolve_paragraph(p)
                last_style = self.styles.resolve_paragraph(run[-1])
                if style.has_identical_borders(last_style):
                    run.append(p)
                    continue
            if len(run) > 1:
                process_run(run)
            run = [p]
        if len(run) > 1:
            process_run(run)
 if __name__ == '__main__':
    import shutil
    from calibre.utils.logging import default_log
    default_log.filter_level = default_log.DEBUG
    dest_dir = os.path.join(getcwd(), 'docx_input')
    if os.path.exists(dest_dir):
        shutil.rmtree(dest_dir)
    os.mkdir(dest_dir)
    Convert(sys.argv[-1], dest_dir=dest_dir, log=default_log)()
--- a/ebook_converter/ebooks/docx/toc.py
+++ b/ebook_converter/ebooks/docx/toc.py
@@ -0,0 +1,143 @@
 #!/usr/bin/env python2
 # vim:fileencoding=utf-8
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
 from collections import namedtuple
 from itertools import count
 from lxml.etree import tostring
 from calibre.ebooks.metadata.toc import TOC
 from calibre.ebooks.oeb.polish.toc import elem_to_toc_text
 from polyglot.builtins import iteritems, range
 def from_headings(body, log, namespace, num_levels=3):
    ' Create a TOC from headings in the document '
    tocroot = TOC()
    all_heading_nodes = body.xpath('//*[@data-heading-level]')
    level_prev = {i+1:None for i in range(num_levels)}
    level_prev[0] = tocroot
    level_item_map = {i:frozenset(
        x for x in all_heading_nodes if int(x.get('data-heading-level')) == i)
        for i in range(1, num_levels+1)}
    item_level_map = {e:i for i, elems in iteritems(level_item_map) for e in elems}
    idcount = count()
    def ensure_id(elem):
        ans = elem.get('id', None)
        if not ans:
            ans = 'toc_id_%d' % (next(idcount) + 1)
            elem.set('id', ans)
        return ans
    for item in all_heading_nodes:
        lvl = plvl = item_level_map.get(item, None)
        if lvl is None:
            continue
        parent = None
        while parent is None:
            plvl -= 1
            parent = level_prev[plvl]
        lvl = plvl + 1
        elem_id = ensure_id(item)
        text = elem_to_toc_text(item)
        toc = parent.add_item('index.html', elem_id, text)
        level_prev[lvl] = toc
        for i in range(lvl+1, num_levels+1):
            level_prev[i] = None
    if len(tuple(tocroot.flat())) > 1:
        log('Generating Table of Contents from headings')
        return tocroot
 def structure_toc(entries):
    indent_vals = sorted({x.indent for x in entries})
    last_found = [None for i in indent_vals]
    newtoc = TOC()
    if len(indent_vals) > 6:
        for x in entries:
            newtoc.add_item('index.html', x.anchor, x.text)
        return newtoc
    def find_parent(level):
        candidates = last_found[:level]
        for x in reversed(candidates):
            if x is not None:
                return x
        return newtoc
    for item in entries:
        level = indent_vals.index(item.indent)
        parent = find_parent(level)
        last_found[level] = parent.add_item('index.html', item.anchor,
                    item.text)
        for i in range(level+1, len(last_found)):
            last_found[i] = None
    return newtoc
 def link_to_txt(a, styles, object_map):
    if len(a) > 1:
        for child in a:
            run = object_map.get(child, None)
            if run is not None:
                rs = styles.resolve(run)
                if rs.css.get('display', None) == 'none':
                    a.remove(child)
    return tostring(a, method='text', with_tail=False, encoding='unicode').strip()
 def from_toc(docx, link_map, styles, object_map, log, namespace):
    XPath, get, ancestor = namespace.XPath, namespace.get, namespace.ancestor
    toc_level = None
    level = 0
    TI = namedtuple('TI', 'text anchor indent')
    toc = []
    for tag in XPath('//*[(@w:fldCharType and name()="w:fldChar") or name()="w:hyperlink" or name()="w:instrText"]')(docx):
        n = tag.tag.rpartition('}')[-1]
        if n == 'fldChar':
            t = get(tag, 'w:fldCharType')
            if t == 'begin':
                level += 1
            elif t == 'end':
                level -= 1
                if toc_level is not None and level < toc_level:
                    break
        elif n == 'instrText':
            if level > 0 and tag.text and tag.text.strip().startswith('TOC '):
                toc_level = level
        elif n == 'hyperlink':
            if toc_level is not None and level >= toc_level and tag in link_map:
                a = link_map[tag]
                href = a.get('href', None)
                txt = link_to_txt(a, styles, object_map)
                p = ancestor(tag, 'w:p')
                if txt and href and p is not None:
                    ps = styles.resolve_paragraph(p)
                    try:
                        ml = int(ps.margin_left[:-2])
                    except (TypeError, ValueError, AttributeError):
                        ml = 0
                    if ps.text_align in {'center', 'right'}:
                        ml = 0
                    toc.append(TI(txt, href[1:], ml))
    if toc:
        log('Found Word Table of Contents, using it to generate the Table of Contents')
        return structure_toc(toc)
 def create_toc(docx, body, link_map, styles, object_map, log, namespace):
    ans = from_toc(docx, link_map, styles, object_map, log, namespace) or from_headings(body, log, namespace)
    # Remove heading level attributes
    for h in body.xpath('//*[@data-heading-level]'):
        del h.attrib['data-heading-level']
    return ans
--- a/ebook_converter/ebooks/html/init.py
+++ b/ebook_converter/ebooks/html/init.py
@@ -0,0 +1,7 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
--- a/ebook_converter/ebooks/html/input.py
+++ b/ebook_converter/ebooks/html/input.py
@@ -0,0 +1,258 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 '''
 Input plugin for HTML or OPF ebooks.
 '''
 import os, re, sys,  errno as gerrno
 from calibre.ebooks.oeb.base import urlunquote
 from calibre.ebooks.chardet import detect_xml_encoding
 from calibre.constants import iswindows
 from calibre import unicode_path, as_unicode, replace_entities
 from polyglot.builtins import is_py3, unicode_type
 from polyglot.urllib import urlparse, urlunparse
 class Link(object):
    '''
    Represents a link in a HTML file.
    '''
    @classmethod
    def url_to_local_path(cls, url, base):
        path = url.path
        isabs = False
        if iswindows and path.startswith('/'):
            path = path[1:]
            isabs = True
        path = urlunparse(('', '', path, url.params, url.query, ''))
        path = urlunquote(path)
        if isabs or os.path.isabs(path):
            return path
        return os.path.abspath(os.path.join(base, path))
    def __init__(self, url, base):
        '''
        :param url:  The url this link points to. Must be an unquoted unicode string.
        :param base: The base directory that relative URLs are with respect to.
                     Must be a unicode string.
        '''
        assert isinstance(url, unicode_type) and isinstance(base, unicode_type)
        self.url         = url
        self.parsed_url  = urlparse(self.url)
        self.is_local    = self.parsed_url.scheme in ('', 'file')
        self.is_internal = self.is_local and not bool(self.parsed_url.path)
        self.path        = None
        self.fragment    = urlunquote(self.parsed_url.fragment)
        if self.is_local and not self.is_internal:
            self.path = self.url_to_local_path(self.parsed_url, base)
    def __hash__(self):
        if self.path is None:
            return hash(self.url)
        return hash(self.path)
    def __eq__(self, other):
        return self.path == getattr(other, 'path', other)
    def __str__(self):
        return 'Link: %s --> %s'%(self.url, self.path)
    if not is_py3:
        __unicode__ = __str__
 class IgnoreFile(Exception):
    def __init__(self, msg, errno):
        Exception.__init__(self, msg)
        self.doesnt_exist = errno == gerrno.ENOENT
        self.errno = errno
 class HTMLFile(object):
    '''
    Contains basic information about an HTML file. This
    includes a list of links to other files as well as
    the encoding of each file. Also tries to detect if the file is not a HTML
    file in which case :member:`is_binary` is set to True.
    The encoding of the file is available as :member:`encoding`.
    '''
    HTML_PAT  = re.compile(r'<\s*html', re.IGNORECASE)
    TITLE_PAT = re.compile('<title>([^<>]+)</title>', re.IGNORECASE)
    LINK_PAT  = re.compile(
    r'<\s*a\s+.*?href\s*=\s*(?:(?:"(?P<url1>[^"]+)")|(?:\'(?P<url2>[^\']+)\')|(?P<url3>[^\s>]+))',
    re.DOTALL|re.IGNORECASE)
    def __init__(self, path_to_html_file, level, encoding, verbose, referrer=None):
        '''
        :param level: The level of this file. Should be 0 for the root file.
        :param encoding: Use `encoding` to decode HTML.
        :param referrer: The :class:`HTMLFile` that first refers to this file.
        '''
        self.path     = unicode_path(path_to_html_file, abs=True)
        self.title    = os.path.splitext(os.path.basename(self.path))[0]
        self.base     = os.path.dirname(self.path)
        self.level    = level
        self.referrer = referrer
        self.links    = []
        try:
            with open(self.path, 'rb') as f:
                src = header = f.read(4096)
                encoding = detect_xml_encoding(src)[1]
                if encoding:
                    try:
                        header = header.decode(encoding)
                    except ValueError:
                        pass
                self.is_binary = level > 0 and not bool(self.HTML_PAT.search(header))
                if not self.is_binary:
                    src += f.read()
        except IOError as err:
            msg = 'Could not read from file: %s with error: %s'%(self.path, as_unicode(err))
            if level == 0:
                raise IOError(msg)
            raise IgnoreFile(msg, err.errno)
        if not src:
            if level == 0:
                raise ValueError('The file %s is empty'%self.path)
            self.is_binary = True
        if not self.is_binary:
            if not encoding:
                encoding = detect_xml_encoding(src[:4096], verbose=verbose)[1]
                self.encoding = encoding
            else:
                self.encoding = encoding
            src = src.decode(encoding, 'replace')
            match = self.TITLE_PAT.search(src)
            self.title = match.group(1) if match is not None else self.title
            self.find_links(src)
    def __eq__(self, other):
        return self.path == getattr(other, 'path', other)
    def __hash__(self):
        return hash(self.path)
    def __str__(self):
        return 'HTMLFile:%d:%s:%s'%(self.level, 'b' if self.is_binary else 'a', self.path)
    def __repr__(self):
        return unicode_type(self)
    def find_links(self, src):
        for match in self.LINK_PAT.finditer(src):
            url = None
            for i in ('url1', 'url2', 'url3'):
                url = match.group(i)
                if url:
                    break
            url = replace_entities(url)
            try:
                link = self.resolve(url)
            except ValueError:
                # Unparseable URL, ignore
                continue
            if link not in self.links:
                self.links.append(link)
    def resolve(self, url):
        return Link(url, self.base)
 def depth_first(root, flat, visited=None):
    yield root
    if visited is None:
        visited = set()
    visited.add(root)
    for link in root.links:
        if link.path is not None and link not in visited:
            try:
                index = flat.index(link)
            except ValueError:  # Can happen if max_levels is used
                continue
            hf = flat[index]
            if hf not in visited:
                yield hf
                visited.add(hf)
                for hf in depth_first(hf, flat, visited):
                    if hf not in visited:
                        yield hf
                        visited.add(hf)
 def traverse(path_to_html_file, max_levels=sys.maxsize, verbose=0, encoding=None):
    '''
    Recursively traverse all links in the HTML file.
    :param max_levels: Maximum levels of recursion. Must be non-negative. 0
                       implies that no links in the root HTML file are followed.
    :param encoding:   Specify character encoding of HTML files. If `None` it is
                       auto-detected.
    :return:           A pair of lists (breadth_first, depth_first). Each list contains
                       :class:`HTMLFile` objects.
    '''
    assert max_levels >= 0
    level = 0
    flat =  [HTMLFile(path_to_html_file, level, encoding, verbose)]
    next_level = list(flat)
    while level < max_levels and len(next_level) > 0:
        level += 1
        nl = []
        for hf in next_level:
            rejects = []
            for link in hf.links:
                if link.path is None or link.path in flat:
                    continue
                try:
                    nf = HTMLFile(link.path, level, encoding, verbose, referrer=hf)
                    if nf.is_binary:
                        raise IgnoreFile('%s is a binary file'%nf.path, -1)
                    nl.append(nf)
                    flat.append(nf)
                except IgnoreFile as err:
                    rejects.append(link)
                    if not err.doesnt_exist or verbose > 1:
                        print(repr(err))
            for link in rejects:
                hf.links.remove(link)
        next_level = list(nl)
    orec = sys.getrecursionlimit()
    sys.setrecursionlimit(500000)
    try:
        return flat, list(depth_first(flat[0], flat))
    finally:
        sys.setrecursionlimit(orec)
 def get_filelist(htmlfile, dir, opts, log):
    '''
    Build list of files referenced by html file or try to detect and use an
    OPF file instead.
    '''
    log.info('Building file list...')
    filelist = traverse(htmlfile, max_levels=int(opts.max_levels),
                        verbose=opts.verbose,
                        encoding=opts.input_encoding)[0 if opts.breadth_first else 1]
    if opts.verbose:
        log.debug('\tFound files...')
        for f in filelist:
            log.debug('\t\t', f)
    return filelist
--- a/ebook_converter/ebooks/html/to_zip.py
+++ b/ebook_converter/ebooks/html/to_zip.py
@@ -0,0 +1,122 @@
 #!/usr/bin/env python2
 # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2011, Kovid Goyal <kovid@kovidgoyal.net>'
 __docformat__ = 'restructuredtext en'
 import textwrap, os, glob
 from calibre.customize import FileTypePlugin
 from calibre.constants import numeric_version
 from polyglot.builtins import unicode_type
 class HTML2ZIP(FileTypePlugin):
    name = 'HTML to ZIP'
    author = 'Kovid Goyal'
    description = textwrap.dedent(_('''\
 Follow all local links in an HTML file and create a ZIP \
 file containing all linked files. This plugin is run \
 every time you add an HTML file to the library.\
 '''))
    version = numeric_version
    file_types = {'html', 'htm', 'xhtml', 'xhtm', 'shtm', 'shtml'}
    supported_platforms = ['windows', 'osx', 'linux']
    on_import = True
    def run(self, htmlfile):
        import codecs
        from calibre import prints
        from calibre.ptempfile import TemporaryDirectory
        from calibre.gui2.convert.gui_conversion import gui_convert
        from calibre.customize.conversion import OptionRecommendation
        from calibre.ebooks.epub import initialize_container
        with TemporaryDirectory('_plugin_html2zip') as tdir:
            recs =[('debug_pipeline', tdir, OptionRecommendation.HIGH)]
            recs.append(['keep_ligatures', True, OptionRecommendation.HIGH])
            if self.site_customization and self.site_customization.strip():
                sc = self.site_customization.strip()
                enc, _, bf = sc.partition('|')
                if enc:
                    try:
                        codecs.lookup(enc)
                    except Exception:
                        prints('Ignoring invalid input encoding for HTML:', enc)
                    else:
                        recs.append(['input_encoding', enc, OptionRecommendation.HIGH])
                if bf == 'bf':
                    recs.append(['breadth_first', True,
                        OptionRecommendation.HIGH])
            gui_convert(htmlfile, tdir, recs, abort_after_input_dump=True)
            of = self.temporary_file('_plugin_html2zip.zip')
            tdir = os.path.join(tdir, 'input')
            opf = glob.glob(os.path.join(tdir, '*.opf'))[0]
            ncx = glob.glob(os.path.join(tdir, '*.ncx'))
            if ncx:
                os.remove(ncx[0])
            epub = initialize_container(of.name, os.path.basename(opf))
            epub.add_dir(tdir)
            epub.close()
        return of.name
    def customization_help(self, gui=False):
        return _('Character encoding for the input HTML files. Common choices '
        'include: cp1252, cp1251, latin1 and utf-8.')
    def do_user_config(self, parent=None):
        '''
        This method shows a configuration dialog for this plugin. It returns
        True if the user clicks OK, False otherwise. The changes are
        automatically applied.
        '''
        from PyQt5.Qt import (QDialog, QDialogButtonBox, QVBoxLayout,
                QLabel, Qt, QLineEdit, QCheckBox)
        config_dialog = QDialog(parent)
        button_box = QDialogButtonBox(QDialogButtonBox.Ok | QDialogButtonBox.Cancel)
        v = QVBoxLayout(config_dialog)
        def size_dialog():
            config_dialog.resize(config_dialog.sizeHint())
        button_box.accepted.connect(config_dialog.accept)
        button_box.rejected.connect(config_dialog.reject)
        config_dialog.setWindowTitle(_('Customize') + ' ' + self.name)
        from calibre.customize.ui import (plugin_customization,
                customize_plugin)
        help_text = self.customization_help(gui=True)
        help_text = QLabel(help_text, config_dialog)
        help_text.setWordWrap(True)
        help_text.setTextInteractionFlags(Qt.LinksAccessibleByMouse | Qt.LinksAccessibleByKeyboard)
        help_text.setOpenExternalLinks(True)
        v.addWidget(help_text)
        bf = QCheckBox(_('Add linked files in breadth first order'))
        bf.setToolTip(_('Normally, when following links in HTML files'
            ' calibre does it depth first, i.e. if file A links to B and '
            ' C, but B links to D, the files are added in the order A, B, D, C. '
            ' With this option, they will instead be added as A, B, C, D'))
        sc = plugin_customization(self)
        if not sc:
            sc = ''
        sc = sc.strip()
        enc = sc.partition('|')[0]
        bfs = sc.partition('|')[-1]
        bf.setChecked(bfs == 'bf')
        sc = QLineEdit(enc, config_dialog)
        v.addWidget(sc)
        v.addWidget(bf)
        v.addWidget(button_box)
        size_dialog()
        config_dialog.exec_()
        if config_dialog.result() == QDialog.Accepted:
            sc = unicode_type(sc.text()).strip()
            if bf.isChecked():
                sc += '|bf'
            customize_plugin(self, sc)
        return config_dialog.result()
--- a/ebook_converter/ebooks/html_entities.py
+++ b/ebook_converter/ebooks/html_entities.py
--- a/ebook_converter/ebooks/lrf/init.py
+++ b/ebook_converter/ebooks/lrf/init.py
@@ -0,0 +1,115 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 """
 This package contains logic to read and write LRF files.
 The LRF file format is documented at U{http://www.sven.de/librie/Librie/LrfFormat}.
 """
 from calibre.ebooks.lrf.pylrs.pylrs import Book as _Book
 from calibre.ebooks.lrf.pylrs.pylrs import TextBlock, Header, \
                                             TextStyle, BlockStyle
 from calibre.ebooks.lrf.fonts import FONT_FILE_MAP
 from calibre.ebooks import ConversionError
 __docformat__ = "epytext"
 class LRFParseError(Exception):
    pass
 class PRS500_PROFILE(object):
    screen_width  = 600
    screen_height = 775
    dpi           = 166
    # Number of pixels to subtract from screen_height when calculating height of text area
    fudge         = 0
    font_size     = 10  #: Default (in pt)
    parindent     = 10  #: Default (in pt)
    line_space    = 1.2  # : Default (in pt)
    header_font_size = 6  #: In pt
    header_height    = 30  # : In px
    default_fonts    = {'sans': "Swis721 BT Roman", 'mono': "Courier10 BT Roman",
                         'serif': "Dutch801 Rm BT Roman"}
    name = 'prs500'
 def find_custom_fonts(options, logger):
    from calibre.utils.fonts.scanner import font_scanner
    fonts = {'serif' : None, 'sans' : None, 'mono' : None}
    def family(cmd):
        return cmd.split(',')[-1].strip()
    if options.serif_family:
        f = family(options.serif_family)
        fonts['serif'] = font_scanner.legacy_fonts_for_family(f)
        if not fonts['serif']:
            logger.warn('Unable to find serif family %s'%f)
    if options.sans_family:
        f = family(options.sans_family)
        fonts['sans'] = font_scanner.legacy_fonts_for_family(f)
        if not fonts['sans']:
            logger.warn('Unable to find sans family %s'%f)
    if options.mono_family:
        f = family(options.mono_family)
        fonts['mono'] = font_scanner.legacy_fonts_for_family(f)
        if not fonts['mono']:
            logger.warn('Unable to find mono family %s'%f)
    return fonts
 def Book(options, logger, font_delta=0, header=None,
         profile=PRS500_PROFILE, **settings):
    from uuid import uuid4
    ps = {}
    ps['topmargin']      = options.top_margin
    ps['evensidemargin'] = options.left_margin
    ps['oddsidemargin']  = options.left_margin
    ps['textwidth']      = profile.screen_width - (options.left_margin + options.right_margin)
    ps['textheight']     = profile.screen_height - (options.top_margin + options.bottom_margin) \
                                                 - profile.fudge
    if header:
        hdr = Header()
        hb = TextBlock(textStyle=TextStyle(align='foot',
                                           fontsize=int(profile.header_font_size*10)),
                       blockStyle=BlockStyle(blockwidth=ps['textwidth']))
        hb.append(header)
        hdr.PutObj(hb)
        ps['headheight'] = profile.header_height
        ps['headsep']    = options.header_separation
        ps['header']     = hdr
        ps['topmargin']  = 0
        ps['textheight'] = profile.screen_height - (options.bottom_margin + ps['topmargin']) \
                                                 - ps['headheight'] - ps['headsep'] - profile.fudge
    fontsize = int(10*profile.font_size+font_delta*20)
    baselineskip = fontsize + 20
    fonts = find_custom_fonts(options, logger)
    tsd = dict(fontsize=fontsize,
               parindent=int(10*profile.parindent),
               linespace=int(10*profile.line_space),
               baselineskip=baselineskip,
               wordspace=10*options.wordspace)
    if fonts['serif'] and 'normal' in fonts['serif']:
        tsd['fontfacename'] = fonts['serif']['normal'][1]
    book = _Book(textstyledefault=tsd,
                pagestyledefault=ps,
                blockstyledefault=dict(blockwidth=ps['textwidth']),
                bookid=uuid4().hex,
                **settings)
    for family in fonts.keys():
        if fonts[family]:
            for font in fonts[family].values():
                book.embed_font(*font)
                FONT_FILE_MAP[font[1]] = font[0]
    for family in ['serif', 'sans', 'mono']:
        if not fonts[family]:
            fonts[family] = {'normal' : (None, profile.default_fonts[family])}
        elif 'normal' not in fonts[family]:
            raise ConversionError('Could not find the normal version of the ' + family + ' font')
    return book, fonts
--- a/ebook_converter/ebooks/lrf/fonts.py
+++ b/ebook_converter/ebooks/lrf/fonts.py
@@ -0,0 +1,33 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 from PIL import ImageFont
 '''
 Default fonts used in the PRS500
 '''
 LIBERATION_FONT_MAP = {
            'Swis721 BT Roman'     : 'LiberationSans-Regular',
            'Dutch801 Rm BT Roman' : 'LiberationSerif-Regular',
            'Courier10 BT Roman'   : 'LiberationMono-Regular',
            }
 FONT_FILE_MAP = {}
 def get_font(name, size, encoding='unic'):
    '''
    Get an ImageFont object by name.
    @param size: Font height in pixels. To convert from pts:
                 sz in pixels = (dpi/72) * size in pts
    @param encoding: Font encoding to use. E.g. 'unic', 'symbol', 'ADOB', 'ADBE', 'aprm'
    @param manager: A dict that will store the PersistentTemporary
    '''
    if name in LIBERATION_FONT_MAP:
        return ImageFont.truetype(P('fonts/liberation/%s.ttf' % LIBERATION_FONT_MAP[name]), size, encoding=encoding)
    elif name in FONT_FILE_MAP:
        return ImageFont.truetype(FONT_FILE_MAP[name], size, encoding=encoding)
--- a/ebook_converter/ebooks/lrf/html/init.py
+++ b/ebook_converter/ebooks/lrf/html/init.py
@@ -0,0 +1,10 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 """
 This package contains code to convert HTML ebooks to LRF ebooks.
 """
 __docformat__ = "epytext"
 __author__    = "Kovid Goyal <kovid@kovidgoyal.net>"
--- a/ebook_converter/ebooks/lrf/html/color_map.py
+++ b/ebook_converter/ebooks/lrf/html/color_map.py
@@ -0,0 +1,115 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import re
 NAME_MAP = {
             'aliceblue': '#F0F8FF',
             'antiquewhite': '#FAEBD7',
             'aqua': '#00FFFF',
             'aquamarine': '#7FFFD4',
             'azure': '#F0FFFF',
             'beige': '#F5F5DC',
             'bisque': '#FFE4C4',
             'black': '#000000',
             'blanchedalmond': '#FFEBCD',
             'blue': '#0000FF',
             'brown': '#A52A2A',
             'burlywood': '#DEB887',
             'cadetblue': '#5F9EA0',
             'chartreuse': '#7FFF00',
             'chocolate': '#D2691E',
             'coral': '#FF7F50',
             'crimson': '#DC143C',
             'cyan': '#00FFFF',
             'darkblue': '#00008B',
             'darkgoldenrod': '#B8860B',
             'darkgreen': '#006400',
             'darkkhaki': '#BDB76B',
             'darkmagenta': '#8B008B',
             'darkolivegreen': '#556B2F',
             'darkorange': '#FF8C00',
             'darkorchid': '#9932CC',
             'darkred': '#8B0000',
             'darksalmon': '#E9967A',
             'darkslateblue': '#483D8B',
             'darkslategrey': '#2F4F4F',
             'darkviolet': '#9400D3',
             'deeppink': '#FF1493',
             'dodgerblue': '#1E90FF',
             'firebrick': '#B22222',
             'floralwhite': '#FFFAF0',
             'forestgreen': '#228B22',
             'fuchsia': '#FF00FF',
             'gainsboro': '#DCDCDC',
             'ghostwhite': '#F8F8FF',
             'gold': '#FFD700',
             'goldenrod': '#DAA520',
             'indianred ': '#CD5C5C',
             'indigo  ': '#4B0082',
             'khaki': '#F0E68C',
             'lavenderblush': '#FFF0F5',
             'lawngreen': '#7CFC00',
             'lightblue': '#ADD8E6',
             'lightcoral': '#F08080',
             'lightgoldenrodyellow': '#FAFAD2',
             'lightgray': '#D3D3D3',
             'lightgrey': '#D3D3D3',
             'lightskyblue': '#87CEFA',
             'lightslategrey': '#778899',
             'lightsteelblue': '#B0C4DE',
             'lime': '#87CEFA',
             'linen': '#FAF0E6',
             'magenta': '#FF00FF',
             'maroon': '#800000',
             'mediumaquamarine': '#66CDAA',
             'mediumblue': '#0000CD',
             'mediumorchid': '#BA55D3',
             'mediumpurple': '#9370D8',
             'mediumseagreen': '#3CB371',
             'mediumslateblue': '#7B68EE',
             'midnightblue': '#191970',
             'moccasin': '#FFE4B5',
             'navajowhite': '#FFDEAD',
             'navy': '#000080',
             'oldlace': '#FDF5E6',
             'olive': '#808000',
             'orange': '#FFA500',
             'orangered': '#FF4500',
             'orchid': '#DA70D6',
             'paleturquoise': '#AFEEEE',
             'papayawhip': '#FFEFD5',
             'peachpuff': '#FFDAB9',
             'powderblue': '#B0E0E6',
             'rosybrown': '#BC8F8F',
             'royalblue': '#4169E1',
             'saddlebrown': '#8B4513',
             'sandybrown': '#8B4513',
             'seashell': '#FFF5EE',
             'sienna': '#A0522D',
             'silver': '#C0C0C0',
             'skyblue': '#87CEEB',
             'slategrey': '#708090',
             'snow': '#FFFAFA',
             'springgreen': '#00FF7F',
             'violet': '#EE82EE',
             'yellowgreen': '#9ACD32'
            }
 hex_pat = re.compile(r'#(\d{2})(\d{2})(\d{2})')
 rgb_pat = re.compile(r'rgb\(\s*(\d+)\s*,\s*(\d+)\s*,\s*(\d+)\s*\)', re.IGNORECASE)
 def lrs_color(html_color):
    hcol = html_color.lower()
    match = hex_pat.search(hcol)
    if match:
        return '0x00'+match.group(1)+match.group(2)+match.group(3)
    match = rgb_pat.search(hcol)
    if match:
        return '0x00'+hex(int(match.group(1)))[2:]+hex(int(match.group(2)))[2:]+hex(int(match.group(3)))[2:]
    if hcol in NAME_MAP:
        return NAME_MAP[hcol].replace('#', '0x00')
    return '0x00000000'
--- a/ebook_converter/ebooks/lrf/html/convert_from.py
+++ b/ebook_converter/ebooks/lrf/html/convert_from.py
--- a/ebook_converter/ebooks/lrf/html/table.py
+++ b/ebook_converter/ebooks/lrf/html/table.py
@@ -0,0 +1,386 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 __license__   = 'GPL v3'
 __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
 import math, sys, re, numbers
 from calibre.ebooks.lrf.fonts import get_font
 from calibre.ebooks.lrf.pylrs.pylrs import TextBlock, Text, CR, Span, \
                                             CharButton, Plot, Paragraph, \
                                             LrsTextTag
 from polyglot.builtins import string_or_bytes, range, native_string_type
 def ceil(num):
    return int(math.ceil(num))
 def print_xml(elem):
    from calibre.ebooks.lrf.pylrs.pylrs import ElementWriter
    elem = elem.toElement(native_string_type('utf8'))
    ew = ElementWriter(elem, sourceEncoding=native_string_type('utf8'))
    ew.write(sys.stdout)
    print()
 def cattrs(base, extra):
    new = base.copy()
    new.update(extra)
    return new
 def tokens(tb):
    '''
    Return the next token. A token is :
    1. A string
    a block of text that has the same style
    '''
    def process_element(x, attrs):
        if isinstance(x, CR):
            yield 2, None
        elif isinstance(x, Text):
            yield x.text, cattrs(attrs, {})
        elif isinstance(x, string_or_bytes):
            yield x, cattrs(attrs, {})
        elif isinstance(x, (CharButton, LrsTextTag)):
            if x.contents:
                if hasattr(x.contents[0], 'text'):
                    yield x.contents[0].text, cattrs(attrs, {})
                elif hasattr(x.contents[0], 'attrs'):
                    for z in process_element(x.contents[0], x.contents[0].attrs):
                        yield z
        elif isinstance(x, Plot):
            yield x, None
        elif isinstance(x, Span):
            attrs = cattrs(attrs, x.attrs)
            for y in x.contents:
                for z in process_element(y, attrs):
                    yield z
    for i in tb.contents:
        if isinstance(i, CR):
            yield 1, None
        elif isinstance(i, Paragraph):
            for j in i.contents:
                attrs = {}
                if hasattr(j, 'attrs'):
                    attrs = j.attrs
                for k in process_element(j, attrs):
                    yield k
 class Cell(object):
    def __init__(self, conv, tag, css):
        self.conv = conv
        self.tag = tag
        self.css  = css
        self.text_blocks = []
        self.pwidth = -1.
        if tag.has_attr('width') and '%' in tag['width']:
            try:
                self.pwidth = float(tag['width'].replace('%', ''))
            except ValueError:
                pass
        if 'width' in css and '%' in css['width']:
            try:
                self.pwidth = float(css['width'].replace('%', ''))
            except ValueError:
                pass
        if self.pwidth > 100:
            self.pwidth = -1
        self.rowspan = self.colspan = 1
        try:
            self.colspan = int(tag['colspan']) if tag.has_attr('colspan') else 1
            self.rowspan = int(tag['rowspan']) if tag.has_attr('rowspan') else 1
        except:
            pass
        pp = conv.current_page
        conv.book.allow_new_page = False
        conv.current_page = conv.book.create_page()
        conv.parse_tag(tag, css)
        conv.end_current_block()
        for item in conv.current_page.contents:
            if isinstance(item, TextBlock):
                self.text_blocks.append(item)
        conv.current_page = pp
        conv.book.allow_new_page = True
        if not self.text_blocks:
            tb = conv.book.create_text_block()
            tb.Paragraph(' ')
            self.text_blocks.append(tb)
        for tb in self.text_blocks:
            tb.parent = None
            tb.objId  = 0
            # Needed as we have to eventually change this BlockStyle's width and
            # height attributes. This blockstyle may be shared with other
            # elements, so doing that causes havoc.
            tb.blockStyle = conv.book.create_block_style()
            ts = conv.book.create_text_style(**tb.textStyle.attrs)
            ts.attrs['parindent'] = 0
            tb.textStyle = ts
            if ts.attrs['align'] == 'foot':
                if isinstance(tb.contents[-1], Paragraph):
                    tb.contents[-1].append(' ')
    def pts_to_pixels(self, pts):
        pts = int(pts)
        return ceil((float(self.conv.profile.dpi)/72)*(pts/10))
    def minimum_width(self):
        return max([self.minimum_tb_width(tb) for tb in self.text_blocks])
    def minimum_tb_width(self, tb):
        ts = tb.textStyle.attrs
        default_font = get_font(ts['fontfacename'], self.pts_to_pixels(ts['fontsize']))
        parindent = self.pts_to_pixels(ts['parindent'])
        mwidth = 0
        for token, attrs in tokens(tb):
            font = default_font
            if isinstance(token, numbers.Integral):  # Handle para and line breaks
                continue
            if isinstance(token, Plot):
                return self.pts_to_pixels(token.xsize)
            ff = attrs.get('fontfacename', ts['fontfacename'])
            fs = attrs.get('fontsize', ts['fontsize'])
            if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
                font = get_font(ff, self.pts_to_pixels(fs))
            if not token.strip():
                continue
            word = token.split()
            word = word[0] if word else ""
            width = font.getsize(word)[0]
            if width > mwidth:
                mwidth = width
        return parindent + mwidth + 2
    def text_block_size(self, tb, maxwidth=sys.maxsize, debug=False):
        ts = tb.textStyle.attrs
        default_font = get_font(ts['fontfacename'], self.pts_to_pixels(ts['fontsize']))
        parindent = self.pts_to_pixels(ts['parindent'])
        top, bottom, left, right = 0, 0, parindent, parindent
        def add_word(width, height, left, right, top, bottom, ls, ws):
            if left + width > maxwidth:
                left = width + ws
                top += ls
                bottom = top+ls if top+ls > bottom else bottom
            else:
                left += (width + ws)
                right = left if left > right else right
                bottom = top+ls if top+ls > bottom else bottom
            return left, right, top, bottom
        for token, attrs in tokens(tb):
            if attrs is None:
                attrs = {}
            font = default_font
            ls = self.pts_to_pixels(attrs.get('baselineskip', ts['baselineskip']))+\
                 self.pts_to_pixels(attrs.get('linespace', ts['linespace']))
            ws = self.pts_to_pixels(attrs.get('wordspace', ts['wordspace']))
            if isinstance(token, numbers.Integral):  # Handle para and line breaks
                if top != bottom:  # Previous element not a line break
                    top = bottom
                else:
                    top += ls
                    bottom += ls
                left = parindent if int == 1 else 0
                continue
            if isinstance(token, Plot):
                width, height = self.pts_to_pixels(token.xsize), self.pts_to_pixels(token.ysize)
                left, right, top, bottom = add_word(width, height, left, right, top, bottom, height, ws)
                continue
            ff = attrs.get('fontfacename', ts['fontfacename'])
            fs = attrs.get('fontsize', ts['fontsize'])
            if (ff, fs) != (ts['fontfacename'], ts['fontsize']):
                font = get_font(ff, self.pts_to_pixels(fs))
            for word in token.split():
                width, height = font.getsize(word)
                left, right, top, bottom = add_word(width, height, left, right, top, bottom, ls, ws)
        return right+3+max(parindent, 10), bottom
    def text_block_preferred_width(self, tb, debug=False):
        return self.text_block_size(tb, sys.maxsize, debug=debug)[0]
    def preferred_width(self, debug=False):
        return ceil(max([self.text_block_preferred_width(i, debug=debug) for i in self.text_blocks]))
    def height(self, width):
        return sum([self.text_block_size(i, width)[1] for i in self.text_blocks])
 class Row(object):
    def __init__(self, conv, row, css, colpad):
        self.cells = []
        self.colpad = colpad
        cells = row.findAll(re.compile('td|th', re.IGNORECASE))
        self.targets = []
        for cell in cells:
            ccss = conv.tag_css(cell, css)[0]
            self.cells.append(Cell(conv, cell, ccss))
        for a in row.findAll(id=True) + row.findAll(name=True):
            name = a['name'] if a.has_attr('name') else a['id'] if a.has_attr('id') else None
            if name is not None:
                self.targets.append(name.replace('#', ''))
    def number_of_cells(self):
        '''Number of cells in this row. Respects colspan'''
        ans = 0
        for cell in self.cells:
            ans += cell.colspan
        return ans
    def height(self, widths):
        i, heights = 0, []
        for cell in self.cells:
            width = sum(widths[i:i+cell.colspan])
            heights.append(cell.height(width))
            i += cell.colspan
        if not heights:
            return 0
        return max(heights)
    def cell_from_index(self, col):
        i = -1
        cell = None
        for cell in self.cells:
            for k in range(0, cell.colspan):
                if i == col:
                    break
                i += 1
            if i == col:
                break
        return cell
    def minimum_width(self, col):
        cell = self.cell_from_index(col)
        if not cell:
            return 0
        return cell.minimum_width()
    def preferred_width(self, col):
        cell = self.cell_from_index(col)
        if not cell:
            return 0
        return 0 if cell.colspan > 1 else cell.preferred_width()
    def width_percent(self, col):
        cell = self.cell_from_index(col)
        if not cell:
            return -1
        return -1 if cell.colspan > 1 else cell.pwidth
    def cell_iterator(self):
        for c in self.cells:
            yield c
 class Table(object):
    def __init__(self, conv, table, css, rowpad=10, colpad=10):
        self.rows = []
        self.conv = conv
        self.rowpad = rowpad
        self.colpad = colpad
        rows = table.findAll('tr')
        conv.in_table = True
        for row in rows:
            rcss = conv.tag_css(row, css)[0]
            self.rows.append(Row(conv, row, rcss, colpad))
        conv.in_table = False
    def number_of_columns(self):
        max = 0
        for row in self.rows:
            max = row.number_of_cells() if row.number_of_cells() > max else max
        return max
    def number_or_rows(self):
        return len(self.rows)
    def height(self, maxwidth):
        ''' Return row heights + self.rowpad'''
        widths = self.get_widths(maxwidth)
        return sum([row.height(widths) + self.rowpad for row in self.rows]) - self.rowpad
    def minimum_width(self, col):
        return max([row.minimum_width(col) for row in self.rows])
    def width_percent(self, col):
        return max([row.width_percent(col) for row in self.rows])
    def get_widths(self, maxwidth):
        '''
        Return widths of columns + self.colpad
        '''
        rows, cols = self.number_or_rows(), self.number_of_columns()
        widths = list(range(cols))
        for c in range(cols):
            cellwidths = [0 for i in range(rows)]
            for r in range(rows):
                try:
                    cellwidths[r] = self.rows[r].preferred_width(c)
                except IndexError:
                    continue
            widths[c] = max(cellwidths)
        min_widths = [self.minimum_width(i)+10 for i in range(cols)]
        for i in range(len(widths)):
            wp = self.width_percent(i)
            if wp >= 0:
                widths[i] = max(min_widths[i], ceil((wp/100) * (maxwidth - (cols-1)*self.colpad)))
        itercount = 0
        while sum(widths) > maxwidth-((len(widths)-1)*self.colpad) and itercount < 100:
            for i in range(cols):
                widths[i] = ceil((95/100)*widths[i]) if \
                    ceil((95/100)*widths[i]) >= min_widths[i] else widths[i]
            itercount += 1
        return [i+self.colpad for i in widths]
    def blocks(self, maxwidth, maxheight):
        rows, cols = self.number_or_rows(), self.number_of_columns()
        cellmatrix = [[None for c in range(cols)] for r in range(rows)]
        rowpos = [0 for i in range(rows)]
        for r in range(rows):
            nc = self.rows[r].cell_iterator()
            try:
                while True:
                    cell = next(nc)
                    cellmatrix[r][rowpos[r]] = cell
                    rowpos[r] += cell.colspan
                    for k in range(1, cell.rowspan):
                        try:
                            rowpos[r+k] += 1
                        except IndexError:
                            break
            except StopIteration:  # No more cells in this row
                continue
        widths = self.get_widths(maxwidth)
        heights = [row.height(widths) for row in self.rows]
        xpos = [sum(widths[:i]) for i in range(cols)]
        delta = maxwidth - sum(widths)
        if delta < 0:
            delta = 0
        for r in range(len(cellmatrix)):
            yield None, 0, heights[r], 0, self.rows[r].targets
            for c in range(len(cellmatrix[r])):
                cell = cellmatrix[r][c]
                if not cell:
                    continue
                width = sum(widths[c:c+cell.colspan])-self.colpad*cell.colspan
                sypos = 0
                for tb in cell.text_blocks:
                    tb.blockStyle = self.conv.book.create_block_style(
                                    blockwidth=width,
                                    blockheight=cell.text_block_size(tb, width)[1],
                                    blockrule='horz-fixed')
                    yield tb, xpos[c], sypos, delta, None
                    sypos += tb.blockStyle.attrs['blockheight']
--- a/ebook_converter/ebooks/lrf/pylrs/init.py
+++ b/ebook_converter/ebooks/lrf/pylrs/init.py
@@ -0,0 +1,7 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 """
 This package contains code to generate ebooks in the SONY LRS/F format. It was
 originally developed by Mike Higgins and has been extended and modified by Kovid
 Goyal.
 """
--- a/ebook_converter/ebooks/lrf/pylrs/elements.py
+++ b/ebook_converter/ebooks/lrf/pylrs/elements.py
@@ -0,0 +1,78 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 """ elements.py -- replacements and helpers for ElementTree """
 from polyglot.builtins import unicode_type, string_or_bytes
 class ElementWriter(object):
    def __init__(self, e, header=False, sourceEncoding="ascii",
                 spaceBeforeClose=True, outputEncodingName="UTF-16"):
        self.header = header
        self.e = e
        self.sourceEncoding=sourceEncoding
        self.spaceBeforeClose = spaceBeforeClose
        self.outputEncodingName = outputEncodingName
    def _encodeCdata(self, rawText):
        if isinstance(rawText, bytes):
            rawText = rawText.decode(self.sourceEncoding)
        text = rawText.replace("&", "&amp;")
        text = text.replace("<", "&lt;")
        text = text.replace(">", "&gt;")
        return text
    def _writeAttribute(self, f, name, value):
        f.write(' %s="' % unicode_type(name))
        if not isinstance(value, string_or_bytes):
            value = unicode_type(value)
        value = self._encodeCdata(value)
        value = value.replace('"', '&quot;')
        f.write(value)
        f.write('"')
    def _writeText(self, f, rawText):
        text = self._encodeCdata(rawText)
        f.write(text)
    def _write(self, f, e):
        f.write('<' + unicode_type(e.tag))
        attributes = e.items()
        attributes.sort()
        for name, value in attributes:
            self._writeAttribute(f, name, value)
        if e.text is not None or len(e) > 0:
            f.write('>')
            if e.text:
                self._writeText(f, e.text)
            for e2 in e:
                self._write(f, e2)
            f.write('</%s>' % e.tag)
        else:
            if self.spaceBeforeClose:
                f.write(' ')
            f.write('/>')
        if e.tail is not None:
            self._writeText(f, e.tail)
    def toString(self):
        class x:
            pass
        buffer = []
        x.write = buffer.append
        self.write(x)
        return ''.join(buffer)
    def write(self, f):
        if self.header:
            f.write('<?xml version="1.0" encoding="%s"?>\n' % self.outputEncodingName)
        self._write(f, self.e)
--- a/Show More
+++ b/Show More