Translate a book writen in LaTeX from Slovenian into English

Mar 10, 2022
Open in Github

With permission of the author, we will demonstrate how to translate the book Euclidean Plane Geometry, written by Milan Mitrović from Slovenian into English, without modifying any of the LaTeX commands.

To achieve this, we will first split the book into chunks, each roughly a page long, then translate each chunk into English, and finally stitch them back together.

from openai import OpenAI
import os
from transformers import GPT2Tokenizer

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if you didn't set as an env var>"))

# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("data/geometry_slovenian.tex", "r") as f:
    text = f.read()
1485565
chunks = text.split('\n\n')
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)
Token indices sequence length is longer than the specified maximum sequence length for this model (1327 > 1024). Running this sequence through the model will result in indexing errors
1473

It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.

We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.

def group_chunks(chunks, ntokens, max_len=1000, hard_max_len=3000):
    """
    Group very short chunks, to form approximately page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0
    
    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        # discard chunks that exceed hard max length
        if ntoken > hard_max_len:
            print(f"Warning: Chunk discarded for being too long ({ntoken} tokens > {hard_max_len} token limit). Preview: '{chunk[:50]}...'")
            continue

        # if room in current batch, add new chunk
        if cur_tokens + 1 + ntoken <= max_len:
            cur_batch += "\n\n" + chunk
            cur_tokens += 1 + ntoken  # adds 1 token for the two newlines
        # otherwise, record the batch and start a new one
        else:
            batches.append(cur_batch)
            cur_batch = chunk
            cur_tokens = ntoken
            
    if cur_batch:  # add the last batch if it's not empty
        batches.append(cur_batch)
        
    return batches


chunks = group_chunks(chunks, ntokens)
len(chunks)
869

Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.

The format of the prompt sent to the model consists of:

  1. A high level instruction to translate only the text, but not commands into the desired language
  2. A sample untranslated command, where only the content of the chapter name needs to be translated
  3. The chunk of text to be translated
  4. The translated sample command from 2, which shows the model the beginning of the translation process

The expected output is the translated chunk of text.

def translate_chunk(chunk, model='gpt-3.5-turbo',
                    dest_language='English',
                    sample_translation=("\poglavje{Osnove Geometrije} \label{osn9Geom}", "\poglavje{The basics of Geometry} \label{osn9Geom}")
                    ):
    prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
    
"""
{sample_translation[0]}
{chunk}"""

{sample_translation[1]}
'''
    response = client.chat.completions.create(
        messages=[{"role": "user", "content":prompt}],
        model=model,
        temperature=0,
        top_p=1,
        max_tokens=1500,
    )
    result = response.choices[0].message.content.strip()
    result = result.replace('"""', '') # remove the double quotes, as we used them to surround the text
    return result
print(translate_chunk(chunks[800], model='gpt-3.5-turbo', dest_language='English'))
Let $\mathcal{I}=\mathcal{S}_{AB} \circ\mathcal{S}_{CA}
    \circ\mathcal{S}_{BC}$. By  \ref{izoZrcdrsprq} is
    $\mathcal{I}$ a mirror reflection. Let $A_1$, $B_1$ and $C_1$ be in order the center points of the lines $BC$, $AC$ and $AB$ of the triangle $ABC$.
    Because it is a right triangle is $\mathcal{I}(A_1C_1)=A_1C_1$, which
    means that the line $A_1C_1$ is of this mirror reflection. It is not
    difficult to prove that for the point $A'_1=\mathcal{I}(A_1)$ (both
    lie on the axis $A_1C_1$) is
    $\overrightarrow{A_1A'_1}=3\overrightarrow{A_1C_1}$, so
    $\mathcal{I}=\mathcal{G}_{3\overrightarrow{A_1C_1}}$.

\item  \res{Given are the points $A$ and $B$ on the same side of the line
$p$.
Draw the line  $XY$, which lies on the line $p$ and is consistent
with the given line $l$, so that the sum
$|AX|+|XY|+|YB|$ is minimal.}

Let $A'=\mathcal{G}_{\overrightarrow{MN}}(A)$ (where $M,N\in
p$ and $MN\cong l$). The point $Y$ is obtained as the intersection of the lines $p$
and $X'Y$ (see also example \ref{HeronProbl}).

\item  \res{Let $ABC$ be an isosceles right triangle with a right angle at the vertex $A$. What does the composite
$\mathcal{G}_{\overrightarrow{AB}}\circ \mathcal{G}_{\overrightarrow{CA}}$ represent?}

Let $p$ and $q$ be the simetrali of the sides $CA$ and $AB$ of the triangle
$ABC$. By  \ref{izoZrcDrsKompSrOsn} is:
 $$\mathcal{G}_{\overrightarrow{AB}}\circ
 \mathcal{G}_{\overrightarrow{CA}}=
 \mathcal{S}_q\circ\mathcal{S}_A\circ\mathcal{S}_A\circ\mathcal{S}_p=
 \mathcal{S}_q\circ\mathcal{S}_p.$$ Because $ABC$ is an isosceles
 right triangle with a right angle at the vertex $A$, the lines $p$ and $q$ are perpendicular and intersect at the center $S$
 of the hypotenuse $BC$. Therefore
 $\mathcal{G}_{\overrightarrow{AB}}\circ
 \mathcal{G}_{\overrightarrow{CA}}=\mathcal{S}_q
 \circ\mathcal{S}_p=\mathcal{S}_S$.

\item \res{In the same plane are given the lines
$a$, $b$ and $c$.
Draw the points $A\in a$ and $B\in b$
so that $\mathcal{S}_c(A)=B$.}

We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.

Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.

dest_language = "English"

translated_chunks = []
for i, chunk in enumerate(chunks):
    print(str(i+1) + " / " + str(len(chunks)))
    # translate each chunk
    translated_chunks.append(translate_chunk(chunk, model='gpt-3.5-turbo', dest_language=dest_language))

# join the chunks together
result = '\n\n'.join(translated_chunks)

# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
    f.write(result)
0 / 869
1 / 869
2 / 869
3 / 869
4 / 869
5 / 869
6 / 869
7 / 869
8 / 869
9 / 869
10 / 869
11 / 869
12 / 869
13 / 869
14 / 869
15 / 869
16 / 869
17 / 869
18 / 869
19 / 869
20 / 869
21 / 869
22 / 869
23 / 869
24 / 869
25 / 869
26 / 869
27 / 869
28 / 869
29 / 869
30 / 869
31 / 869
32 / 869
33 / 869
34 / 869
35 / 869
36 / 869
37 / 869
38 / 869
39 / 869
40 / 869
41 / 869
42 / 869
43 / 869
44 / 869
45 / 869
46 / 869
47 / 869
48 / 869
49 / 869
50 / 869
51 / 869
52 / 869
53 / 869
54 / 869
55 / 869
56 / 869
57 / 869
58 / 869
59 / 869
60 / 869
61 / 869
62 / 869
63 / 869
64 / 869
65 / 869
66 / 869
67 / 869
68 / 869
69 / 869
70 / 869
71 / 869
72 / 869
73 / 869
74 / 869
75 / 869
76 / 869
77 / 869
78 / 869
79 / 869
80 / 869
81 / 869
82 / 869
83 / 869
84 / 869
85 / 869
86 / 869
87 / 869
88 / 869
89 / 869
90 / 869
91 / 869
92 / 869
93 / 869
94 / 869
95 / 869
96 / 869
97 / 869
98 / 869
99 / 869
100 / 869
101 / 869
102 / 869
103 / 869
104 / 869
105 / 869
106 / 869
107 / 869
108 / 869
109 / 869
110 / 869
111 / 869
112 / 869
113 / 869
114 / 869
115 / 869
116 / 869
117 / 869
118 / 869
119 / 869
120 / 869
121 / 869
122 / 869
123 / 869
124 / 869
125 / 869
126 / 869
127 / 869
128 / 869
129 / 869
130 / 869
131 / 869
132 / 869
133 / 869
134 / 869
135 / 869
136 / 869
137 / 869
138 / 869
139 / 869
140 / 869
141 / 869
142 / 869
143 / 869
144 / 869
145 / 869
146 / 869
147 / 869
148 / 869
149 / 869
150 / 869
151 / 869
152 / 869
153 / 869
154 / 869
155 / 869
156 / 869
157 / 869
158 / 869
159 / 869
160 / 869
161 / 869
162 / 869
163 / 869
164 / 869
165 / 869
166 / 869
167 / 869
168 / 869
169 / 869
170 / 869
171 / 869
172 / 869
173 / 869
174 / 869
175 / 869
176 / 869
177 / 869
178 / 869
179 / 869
180 / 869
181 / 869
182 / 869
183 / 869
184 / 869
185 / 869
186 / 869
187 / 869
188 / 869
189 / 869
190 / 869
191 / 869
192 / 869
193 / 869
194 / 869
195 / 869
196 / 869
197 / 869
198 / 869
199 / 869
200 / 869
201 / 869
202 / 869
203 / 869
204 / 869
205 / 869
206 / 869
207 / 869
208 / 869
209 / 869
210 / 869
211 / 869
212 / 869
213 / 869
214 / 869
215 / 869
216 / 869
217 / 869
218 / 869
219 / 869
220 / 869
221 / 869
222 / 869
223 / 869
224 / 869
225 / 869
226 / 869
227 / 869
228 / 869
229 / 869
230 / 869
231 / 869
232 / 869
233 / 869
234 / 869
235 / 869
236 / 869
237 / 869
238 / 869
239 / 869
240 / 869
241 / 869
242 / 869
243 / 869
244 / 869
245 / 869
246 / 869
247 / 869
248 / 869
249 / 869
250 / 869
251 / 869
252 / 869
253 / 869
254 / 869
255 / 869
256 / 869
257 / 869
258 / 869
259 / 869
260 / 869
261 / 869
262 / 869
263 / 869
264 / 869
265 / 869
266 / 869
267 / 869
268 / 869
269 / 869
270 / 869
271 / 869
272 / 869
273 / 869
274 / 869
275 / 869
276 / 869
277 / 869
278 / 869
279 / 869
280 / 869
281 / 869
282 / 869
283 / 869
284 / 869
285 / 869
286 / 869
287 / 869
288 / 869
289 / 869
290 / 869
291 / 869
292 / 869
293 / 869
294 / 869
295 / 869
296 / 869
297 / 869
298 / 869
299 / 869
300 / 869
301 / 869
302 / 869
303 / 869
304 / 869
305 / 869
306 / 869
307 / 869
308 / 869
309 / 869
310 / 869
311 / 869
312 / 869
313 / 869
314 / 869
315 / 869
316 / 869
317 / 869
318 / 869
319 / 869
320 / 869
321 / 869
322 / 869
323 / 869
324 / 869
325 / 869
326 / 869
327 / 869
328 / 869
329 / 869
330 / 869
331 / 869
332 / 869
333 / 869
334 / 869
335 / 869
336 / 869
337 / 869
338 / 869
339 / 869
340 / 869
341 / 869
342 / 869
343 / 869
344 / 869
345 / 869
346 / 869
347 / 869
348 / 869
349 / 869
350 / 869
351 / 869
352 / 869
353 / 869
354 / 869
355 / 869
356 / 869
357 / 869
358 / 869
359 / 869
360 / 869
361 / 869
362 / 869
363 / 869
364 / 869
365 / 869
366 / 869
367 / 869
368 / 869
369 / 869
370 / 869
371 / 869
372 / 869
373 / 869
374 / 869
375 / 869
376 / 869
377 / 869
378 / 869
379 / 869
380 / 869
381 / 869
382 / 869
383 / 869
384 / 869
385 / 869
386 / 869
387 / 869
388 / 869
389 / 869
390 / 869
391 / 869
392 / 869
393 / 869
394 / 869
395 / 869
396 / 869
397 / 869
398 / 869
399 / 869
400 / 869
401 / 869
402 / 869
403 / 869
404 / 869
405 / 869
406 / 869
407 / 869
408 / 869
409 / 869
410 / 869
411 / 869
412 / 869
413 / 869
414 / 869
415 / 869
416 / 869
417 / 869
418 / 869
419 / 869
420 / 869
421 / 869
422 / 869
423 / 869
424 / 869
425 / 869
426 / 869
427 / 869
428 / 869
429 / 869
430 / 869
431 / 869
432 / 869
433 / 869
434 / 869
435 / 869
436 / 869
437 / 869
438 / 869
439 / 869
440 / 869
441 / 869
442 / 869
443 / 869
444 / 869
445 / 869
446 / 869
447 / 869
448 / 869
449 / 869
450 / 869
451 / 869
452 / 869
453 / 869
454 / 869
455 / 869
456 / 869
457 / 869
458 / 869
459 / 869
460 / 869
461 / 869
462 / 869
463 / 869
464 / 869
465 / 869
466 / 869
467 / 869
468 / 869
469 / 869
470 / 869
471 / 869
472 / 869
473 / 869
474 / 869
475 / 869
476 / 869
477 / 869
478 / 869
479 / 869
480 / 869
481 / 869
482 / 869
483 / 869
484 / 869
485 / 869
486 / 869
487 / 869
488 / 869
489 / 869
490 / 869
491 / 869
492 / 869
493 / 869
494 / 869
495 / 869
496 / 869
497 / 869
498 / 869
499 / 869
500 / 869
501 / 869
502 / 869
503 / 869
504 / 869
505 / 869
506 / 869
507 / 869
508 / 869
509 / 869
510 / 869
511 / 869
512 / 869
513 / 869
514 / 869
515 / 869
516 / 869
517 / 869
518 / 869
519 / 869
520 / 869
521 / 869
522 / 869
523 / 869
524 / 869
525 / 869
526 / 869
527 / 869
528 / 869
529 / 869
530 / 869
531 / 869
532 / 869
533 / 869
534 / 869
535 / 869
536 / 869
537 / 869
538 / 869
539 / 869
540 / 869
541 / 869
542 / 869
543 / 869
544 / 869
545 / 869
546 / 869
547 / 869
548 / 869
549 / 869
550 / 869
551 / 869
552 / 869
553 / 869
554 / 869
555 / 869
556 / 869
557 / 869
558 / 869
559 / 869
560 / 869
561 / 869
562 / 869
563 / 869
564 / 869
565 / 869
566 / 869
567 / 869
568 / 869
569 / 869
570 / 869
571 / 869
572 / 869
573 / 869
574 / 869
575 / 869
576 / 869
577 / 869
578 / 869
579 / 869
580 / 869
581 / 869
582 / 869
583 / 869
584 / 869
585 / 869
586 / 869
587 / 869
588 / 869
589 / 869
590 / 869
591 / 869
592 / 869
593 / 869
594 / 869
595 / 869
596 / 869
597 / 869
598 / 869
599 / 869
600 / 869
601 / 869
602 / 869
603 / 869
604 / 869
605 / 869
606 / 869
607 / 869
608 / 869
609 / 869
610 / 869
611 / 869
612 / 869
613 / 869
614 / 869
615 / 869
616 / 869
617 / 869
618 / 869
619 / 869
620 / 869
621 / 869
622 / 869
623 / 869
624 / 869
625 / 869
626 / 869
627 / 869
628 / 869
629 / 869
630 / 869
631 / 869
632 / 869
633 / 869
634 / 869
635 / 869
636 / 869
637 / 869
638 / 869
639 / 869
640 / 869
641 / 869
642 / 869
643 / 869
644 / 869
645 / 869
646 / 869
647 / 869
648 / 869
649 / 869
650 / 869
651 / 869
652 / 869
653 / 869
654 / 869
655 / 869
656 / 869
657 / 869
658 / 869
659 / 869
660 / 869
661 / 869
662 / 869
663 / 869
664 / 869
665 / 869
666 / 869
667 / 869
668 / 869
669 / 869
670 / 869
671 / 869
672 / 869
673 / 869
674 / 869
675 / 869
676 / 869
677 / 869
678 / 869
679 / 869
680 / 869
681 / 869
682 / 869
683 / 869
684 / 869
685 / 869
686 / 869
687 / 869
688 / 869
689 / 869
690 / 869
691 / 869
692 / 869
693 / 869
694 / 869
695 / 869
696 / 869
697 / 869
698 / 869
699 / 869
700 / 869
701 / 869
702 / 869
703 / 869
704 / 869
705 / 869
706 / 869
707 / 869
708 / 869
709 / 869
710 / 869
711 / 869
712 / 869
713 / 869
714 / 869
715 / 869
716 / 869
717 / 869
718 / 869
719 / 869
720 / 869
721 / 869
722 / 869
723 / 869
724 / 869
725 / 869
726 / 869
727 / 869
728 / 869
729 / 869
730 / 869
731 / 869
732 / 869
733 / 869
734 / 869
735 / 869
736 / 869
737 / 869
738 / 869
739 / 869
740 / 869
741 / 869
742 / 869
743 / 869
744 / 869
745 / 869
746 / 869
747 / 869
748 / 869
749 / 869
750 / 869
751 / 869
752 / 869
753 / 869
754 / 869
755 / 869
756 / 869
757 / 869
758 / 869
759 / 869
760 / 869
761 / 869
762 / 869
763 / 869
764 / 869
765 / 869
766 / 869
767 / 869
768 / 869
769 / 869
770 / 869
771 / 869
772 / 869
773 / 869
774 / 869
775 / 869
776 / 869
777 / 869
778 / 869
779 / 869
780 / 869
781 / 869
782 / 869
783 / 869
784 / 869
785 / 869
786 / 869
787 / 869
788 / 869
789 / 869
790 / 869
791 / 869
792 / 869
793 / 869
794 / 869
795 / 869
796 / 869
797 / 869
798 / 869
799 / 869
800 / 869
801 / 869
802 / 869
803 / 869
804 / 869
805 / 869
806 / 869
807 / 869
808 / 869
809 / 869
810 / 869
811 / 869
812 / 869
813 / 869
814 / 869
815 / 869
816 / 869
817 / 869
818 / 869
819 / 869
820 / 869
821 / 869
822 / 869
823 / 869
824 / 869
825 / 869
826 / 869
827 / 869
828 / 869
829 / 869
830 / 869
831 / 869
832 / 869
833 / 869
834 / 869
835 / 869
836 / 869
837 / 869
838 / 869
839 / 869
840 / 869
841 / 869
842 / 869
843 / 869
844 / 869
845 / 869
846 / 869
847 / 869
848 / 869
849 / 869
850 / 869
851 / 869
852 / 869
853 / 869
854 / 869
855 / 869
856 / 869
857 / 869
858 / 869
859 / 869
860 / 869
861 / 869
862 / 869
863 / 869
864 / 869
865 / 869
866 / 869
867 / 869
868 / 869