LibreOffice plugin to pipe whole Writer documents through Google Translate, that ought to keep most of the page formatting.

⌈⌋ ⎇ branch:  PageTranslate


Check-in [85d939b85a]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add cli(), translate_python() providers, basic deepl_api(), and non-working deepl_web() implementation.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 85d939b85a873d09b131fb2308ca81e5f24316f5
User & Date: mario 2020-05-24 19:01:38
Context
2020-05-24
19:04
Rename pagetranslate_opts to settings. Adapt registry leave, automate property name detection, safeguard against unknown control types (e.g. framebox), prepare transition to registry read/write (instead of config file). Simplify argparse() and self.params update. check-in: 23054650f3 user: mario tags: trunk
19:01
Add cli(), translate_python() providers, basic deepl_api(), and non-working deepl_web() implementation. check-in: 85d939b85a user: mario tags: trunk
18:59
Introduce more options (microsoft, mymemory, cli) check-in: 4232826ef2 user: mario tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to pythonpath/Makefile.

1
2
3
4
5
6
7
8
9


# The pythonpath/ directory can be packaged alongside, to inject new python
# packages into LibreOffices` python bundle. Which really just makes sense
# for the Windows package, because distro-packaged Office setups utilize the
# system python path.


all:
	pip install requests -t ./ --upgrade












>
>
1
2
3
4
5
6
7
8
9
10
11
# The pythonpath/ directory can be packaged alongside, to inject new python
# packages into LibreOffices` python bundle. Which really just makes sense
# for the Windows package, because distro-packaged Office setups utilize the
# system python path.


all:
	pip install requests -t ./ --upgrade

clean:
	find . -iname '*py[co]*' -exec rm -r {} \;

Changes to pythonpath/translationbackends.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33









34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87

88





89
90
91
92
93
94
95





96


















97






98







































99
100


101




102









103














104







105

106
107
108


109
110
111
112








113





114
115


116
117






















118
119
120
121
122
123
124
125


126
127
128
129
# encoding: utf-8
# api: pagetranslate
# type: classes
# category: language
# title: via_* translation backends
# description: Implements the alternative services (google, deepl, ...)
# version: 1.2
# state: beta
# depends: python:requests (>= 2.5)
# config: -
#
# Different online service backends and http interfaces are now coalesced here.
#


# modules
import re

import urllib
from urllib.parse import urlencode, quote, quote_plus
from httprequests import http
log = None



# translation backend/service
class google:

    # regex
    rx_gtrans = re.compile('class="t0">(.+?)</div>', re.S)
    rx_splitpara = re.compile("(.{1,1895\.}|.{1,1900}\s|.*$)", re.S)
    rx_empty = re.compile("^[\s\d,.:;§():-]+$")
    rx_letters = re.compile("\w\w+", re.UNICODE)
    rx_breakln = re.compile("\s?/\s?#\s?§\s?/\s?")










    def __init__(self, params={}):
        self.params = params  # config+argparse

    # request text translation from google
    def askgoogle(self, text, dst_lang="en", src_lang='auto'):
        # fetch translation page
        url = "http://translate.google.com/m?hl=%s&sl=%s&q=%s" % (
            dst_lang, src_lang, quote_plus(text)
        )
        html = http.get(url).content.decode("utf-8")
        # extract content from text <div>
        m = self.rx_gtrans.search(html)
        if m:
            text = m.group(1)
            text = text.replace("&#39;", "'").replace("&amp;", "&").replace("&lt;", "<").replace("&gt;", ">").replace("&quot;", '"')
            #@todo: https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string
        else:
            log.warning("NO TRANSLATION RESULT EXTRACTED: " + html)
            log.debug("ORIG TEXT: " + repr(text))
        return text

    # iterate over text segments (1900 char limit)        
    def translate(self, text, lang="auto"):
        if lang == "auto":
            lang = self.params["lang"]
        #log.debug("translate %d chars" % len(text))
        if len(text) < 2:
            log.debug("skipping/len<2")
            return text
        elif self.rx_empty.match(text):
            log.debug("skipping/empty")
            return text
        elif not self.rx_letters.search(text):
            log.debug("skipping/noletters")
            return text
        elif len(text) >= 1900:
            log.debug("spliterate/1900+")
            return " ".join(self.askgoogle(segment, lang) for segment in self.rx_splitpara.findall(text))
        else:
            return self.askgoogle(text, lang)
            
    # translate w/ preserving paragraph breaks (meant for table cell content)
    def linebreakwise(self, text, lang="auto"):
        if self.params["crlf"] != "quick":
            # split on linebreaks and translate each individually
            text = "\n\n".join(self.translate(text, lang) for text in text.split("\n\n"))
        else:
            # use temporary placeholder `/#§/`
            text = self.translate(text.replace("\n\n", "/#§/"), lang)
            text = re.sub(self.rx_breakln, "\n\n", text)
        return text









class deepl_web(google):
    # < https://www2.deepl.com/jsonrpc
    # cookies: LMTBID: GUID...
    # referer: https://www.deepl.com/translator
    # body: {"jsonrpc":"2.0","method": "LMT_handle_jobs","params":{"jobs":[{"kind":"default","raw_en_sentence":"...","raw_en_context_before":[],"raw_en_context_after":[],"preferred_num_beams":4,"quality":"fast"}],"lang":{"user_preferred_langs":["DE","EN"],"source_lang_user_selected":"auto","target_lang":"DE"},"priority":-1,"commonJobParams":{},"timestamp":1590258680854},"id":700000000}
    # > result.translations[0].beams[0].postprocessed_sentence
    pass































    







































class deepl_api(deepl_web):
    pass
































# requires `pip install translate`







class translate_python:

    def __init__(self, params={}):
        self.params = params  # config+argparse
        self.error = pagetranslate.MessageBox


        try:
            from translate import Translator
        except:
            self.error("Use `pip install translate` to use this module.")








        self.translate = Translator(provider="microsoft", to_lang=params["lang"], secret_access=params["api_key"])





        self.linebreakwise = self.translate



























# maps a t. object for config dict {"goog":1, "deepl":0}
def assign_service(params):
    if params.get("deepl_web"):
        return deepl_web(self.params)
    elif params.get("deepl_api"):
        return deepl_api(params)
    elif params.get("translate_python"):
        return translate_python(params)


    else:
        return google(params)








|









|
>






<
<
<
<
|
|
|
|
|
|
>
>
>
>
>
>
>
>
>












|

















|


|




|










|
|



>
|
>
>
>
>
>




|

|
>
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
>
>
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

|
>
>
|
>
>
>
>

>
>
>
>
>
>
>
>
>
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>

>
>
>
>
>
>
>
|
>


|
>
>



|
>
>
>
>
>
>
>
>
|
>
>
>
>
>


>
>


>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|


|


|

>
>



<
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24




25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286

# encoding: utf-8
# api: pagetranslate
# type: classes
# category: language
# title: via_* translation backends
# description: Implements the alternative services (google, deepl, ...)
# version: 1.3
# state: beta
# depends: python:requests (>= 2.5)
# config: -
#
# Different online service backends and http interfaces are now coalesced here.
#


# modules
import re, json, time
import os, subprocess, shlex
import urllib
from urllib.parse import urlencode, quote, quote_plus
from httprequests import http
log = None






# regex
rx_gtrans = re.compile('class="t0">(.+?)</div>', re.S)
rx_splitpara = re.compile("(.{1,1895\.}|.{1,1900}\s|.*$)", re.S)
rx_empty = re.compile("^[\s\d,.:;§():-]+$")
rx_letters = re.compile("\w\w+", re.UNICODE)
rx_breakln = re.compile("\s?/\s?#\s?§\s?/\s?")


# Google Translate (default backend)
#
#  · calls mobile page http://translate.google.com/m?hl=en&sl=auto&q=TRANSLATE
#  · iterates over each 1900 characters
#
class google:


    def __init__(self, params={}):
        self.params = params  # config+argparse

    # request text translation from google
    def askgoogle(self, text, dst_lang="en", src_lang='auto'):
        # fetch translation page
        url = "http://translate.google.com/m?hl=%s&sl=%s&q=%s" % (
            dst_lang, src_lang, quote_plus(text)
        )
        html = http.get(url).content.decode("utf-8")
        # extract content from text <div>
        m = rx_gtrans.search(html)
        if m:
            text = m.group(1)
            text = text.replace("&#39;", "'").replace("&amp;", "&").replace("&lt;", "<").replace("&gt;", ">").replace("&quot;", '"')
            #@todo: https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string
        else:
            log.warning("NO TRANSLATION RESULT EXTRACTED: " + html)
            log.debug("ORIG TEXT: " + repr(text))
        return text

    # iterate over text segments (1900 char limit)        
    def translate(self, text, lang="auto"):
        if lang == "auto":
            lang = self.params["lang"]
        #log.debug("translate %d chars" % len(text))
        if len(text) < 2:
            log.debug("skipping/len<2")
            return text
        elif rx_empty.match(text):
            log.debug("skipping/empty")
            return text
        elif not rx_letters.search(text):
            log.debug("skipping/noletters")
            return text
        elif len(text) >= 1900:
            log.debug("spliterate/1900+")
            return " ".join(self.askgoogle(segment, lang) for segment in rx_splitpara.findall(text))
        else:
            return self.askgoogle(text, lang)
            
    # translate w/ preserving paragraph breaks (meant for table cell content)
    def linebreakwise(self, text, lang="auto"):
        if self.params["crlf"] != "quick":
            # split on linebreaks and translate each individually
            text = "\n\n".join(self.translate(text, lang) for text in text.split("\n\n"))
        else:
            # use temporary placeholder `/#§/`
            text = self.translate(text.replace("\n\n", "/#§/"))
            text = re.sub(rx_breakln, "\n\n", text)
        return text


# DeepL online translator uses some kind of json-rpc
#
#  · haven't quite extracted all necesssary bits (origin of id unclear)
#  · will yield HTTP 429 Too many requests,
#    so probably not useful for multi-paragraph translation anyway
#
#
class deepl_web(google):
    # < https://www2.deepl.com/jsonrpc
    # cookies: LMTBID: GUID...
    # referer: https://www.deepl.com/translator
    # body:
    # > result.translations[0].beams[0].postprocessed_sentence
    
    def __init__(self, params):
        self.params = params
        self.id = 702005000
        self.lang = params["lang"].upper()
        r = http.get("https://www.deepl.com/translator")  # should fetch us the cookie / No, it doesn't
        
    def rpc(self, text):
        return json.dumps({
           "jsonrpc" : "2.0",
           "method" : "LMT_handle_jobs",
           "id" : self.id,
           "params" : {
              "lang" : {
                 "target_lang" : self.lang,
                 "user_preferred_langs" : [
                    self.lang,
                    "EN"
                 ],
                 "source_lang_user_selected" : "auto"
              },
              "timestamp" : int(time.time()*1000),
              "priority" : -1,
              "commonJobParams" : {},
              "jobs" : [
                 {
                    "raw_en_context_after" : [],
                    "raw_en_context_before" : [],
                    "kind" : "default",
                    "preferred_num_beams" : 4,
                    "raw_en_sentence" : text,
                    "quality" : "fast"
                 }
              ]
           }
        })
    
    def translate(self, text):
        # skip empty paragraph/table snippets
        if len(text) < 2 or rx_empty.match(text) or not rx_letters.search(text):
            return text
    
        # request
        r = http.post(
            "https://www2.deepl.com/jsonrpc",
            data=self.rpc(text),
            headers={"Referer": "https://www.deepl.com/translator", "Content-Type": "text/plain"}
        )
        if r.status_code != 200:
            log.error(repr(r.content))
            return text
            #return r, r.content
        
        # decode
        r = r.json()
        if r.get("id"):
            self.id = r["id"] + 1
        if r.get("result"):
            return r["result"]["translations"][0]["beams"][0]["postprocessed_sentence"]
        else:
            return text


# DeepL API costs money
#
# Not sure if anyone will use this really. Unless the _web version allows testing,
# nobody's gonna shell out money for a subscription - even if it surpassed GoogleT.
# Likely makes sense for commercial users however. And the API is quite simple, so
# that's why it's here.
#
# ENTIRELY UNTESTED
#    
class deepl_api(deepl_web):

    def __init__(self, params):
        self.params = params
        
    def translate(self, text, preserve=0):
        # skip empty paragraph/table snippets
        if len(text) < 2 or rx_empty.match(text) or not rx_letters.search(text):
            return text

        # https://www.deepl.com/docs-api/translating-text/request/
        r = http.get(
            "https://api.deepl.com/v2/translate", params={
                "auth_key": self.params["api_key"],
                "text": text,
                "target_lang": self.params["lang"],
                "split_sentences": "1",
                "preserve_formatting": str(preserve)
                #"tag_handling": "xml"
            }
        )
        if r.status_code == 200:
            r = r.json().get("translations")
            if r:
                return r[0]["text"]
        else:
            log.error(r.text)
        return text
    
    def linebreakwise(self, text):
        return self.translate(text, preserve=1)


# Translate-python
# requires `pip install translate`
#
#  · provides "microsoft" backend (requires OAuth secret in api_key)
#
#  · or "mymemory" (with email in `api_key` instead)
#
# https://translate-python.readthedocs.io/en/latest/
#
class translate_python(google):

    def __init__(self, params={}):
        self.params = params  # config+argparse
        #self.error = pagetranslate.MessageBox

        Translator = None
        try:
            from translate import Translator
        except:
            raise Exception("Run `pip install translate` to use this module.")
            
        # interestingly this backend function might just work as is.
        if params.get("mymemory"):
            self.translate = Translator(
                provider="mymemory", to_lang=params["lang"], email=params["api_key"]
            )
        else:
            self.translate = Translator(
                provider="microsoft", to_lang=params["lang"], secret_access_key=params["api_key"]
            )

        # though .linebreakwise has no equivalent, not sure if necessaary,
        # or if formatting/linebreaks are preserved anyway
        # (or: we might just use the default google. implementation)
        self.linebreakwise = self.translate

    translate = None
    #linebreakwise = None


# Because, why not?
# Invokes a commandline tool for translating texts.
#
# → with e.g. `translate-cli -t {}` in "api_key"
#
class cli(google):

    def __init__(self, params):
        self.params = params
        self.cmd = params["api_key"]

    # pipe text through external program
    def translate(self, text):
        if rx_empty.match(text) or not rx_letters.search(text):
            return text
        cmd = [s.format(text) for s in shlex.split(self.cmd)]
        proc = subprocess.run(cmd, stdout=subprocess.PIPE)
        return proc.stdout.decode("utf-8")



# maps a pagetranslate.t.* object (in main module),
# according to config dict {"goog":1, "deepl":0}
def assign_service(params):
    if params.get("deepl_web"):
        return deepl_web(params)
    elif params.get("deepl_api"):
        return deepl_api(params)
    elif params.get("translate_python") or params.get("microsoft") or params.get("mymemory"):
        return translate_python(params)
    elif params.get("cli"):
        return cli(params)
    else:
        return google(params)