Javaからプロセス起動で実行するPython と文字列の受け渡し

Javaプログラムとプロセス起動したPython 間で標準入出力を介して文字列を受け渡しをする。
Python は内部処理の文字コードが Unicode のせいか、
受け渡しの文字列は Unicode でなければ、双方で受け取れない。

Python スクリプトの用意
標準入出力実行は、メソッドとして処理をまとめておく

script/util/stdio.py

# -*- coding: UTF-8 -*-

class StdIO:
    def input(self, escape=True):
        inlist = []
        try:
            if escape:
                while True:
                    inp = input('')
                    if inp == '': break
                    inlist.append(inp.encode().decode('unicode-escape'))
            else:
                while True:
                    inp = input('')
                    if inp == '': break
                    inlist.append(inp)
        except EOFError:
            pass
        return inlist

    def printList(self, list, escape=True):
        if escape:
            outlist = []
            for e in list:
                data = e.encode('unicode-escape')
                outlist.append(data)
            print(outlist)
        else:
            print(list)

このinputメソッドは、

escape	処理
True	inputした文字をUnicode文字列に変換してリストにする
False	inputした文字を変換なしでリストにする。

printList は、リストを print する処理でパラメータ escape は同じ

Java からプロセス起動で実行するPython スクリプト
scrpt/sample.py

# -*- coding: UTF-8 -*-

from util.stdio import StdIO

stdio = StdIO()
list = stdio.input()
stdio.printList(list)

Javaからの文字列を標準入力受信→リスト→標準出力
するだけの単純なエコーの処理

Java側の呼び出しプログラム
プロセス起動として
https://github.com/yipuran/yipuran-core/wiki/Script_exec
そ使用する。
また、throwable な Consumer を使用する。
https://github.com/yipuran/yipuran-core/blob/master/src/main/java/org/yipuran/function/ThrowableConsumer.java
Pythonから受信する文字列は、
b'' で括られてくるので解析に特別な処理が必要になる。

// Python に渡す文字列リスト
List<String> list = Arrays.asList("A", "B'B", "C\"", "'\"", "あいうえ1234");

try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
   // script/sample.py を実行 Unicode エンコードして渡す。
   // Python標準出力の受け取りは、IOTransfer.exec(i, out) で、
   ScriptExecutor.runStream(()->"python script/sample.py"
   , ()->list.stream().map(e->Unicodes.encode(e) + "\n").collect(Collectors.toList())
   , ThrowableConsumer.of(i->IOTransfer.exec(i, out))
   , (t, e)->{
      System.out.println("stderr : " + t );
      e.printStackTrace();
   });
   String response = out.toString();

   System.out.println(response);
   System.out.println("===============");

   // 解析
   Unicodes unicodes = Unicodes.of();
   Pattern pattern = Pattern.compile(
"(b'(\\\\'|[ !#$%&\\(\\)\\*\\+,\\-\\./:;<=>?@\\[\\]^_`\\{\\|\\}\\w\\\\\"])+')
|(b\"[ !#$%&\\(\\)\\*\\+,\\-\\./:;<=>?@\\[\\]^_`\\{\\|\\}\\w\\\\']+\")|b''");
   pattern.matcher(response).results()
   .map(r->r.group())
   .map(e->e.substring(2, e.length()-1))
   .map(e->unicodes.parse(e))
   .map(e->e.replaceAll("\\\\'", "'"))
   .forEach(e->{
      System.out.println(   e );
   });

   System.out.println("===============");

}catch(IOException ex){
   ex.printStackTrace();
}finally{
}

↑
IOTransfer.exec(i, out) は、よくある InputStream から OutputStream への
書き出しで、何度も同じような処理を書きたくないので
使い回しのメソッドにしたもの。
↓↓↓（interface にする必要はないが、、とりあえず）

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
/**
 * IOTransfer
 */
public interface IOTransfer{
   public static void exec(InputStream in, OutputStream out) throws IOException{
      byte[] b = new byte[1024];
      int len;
      while((len=in.read(b, 0, b.length)) >= 0){
         out.write(b, 0, len);
      }
      out.flush();
      out.close();
   }
}

この実行結果の標準出力を見て判ることだが、
Python が標準出力した文字列は、b' ' で括られ、ASCII はそのまま、
２バイト文字は、\uXXXX の表現だが、'\' は、'\\' ２文字表現で
出力される。

[b'A', b"B'B", b'C"', b'\'"', b'\\u3042\\u3044\\u3046\\u30481234']

===============
A
B'B
C"
'"
あいうえ1234
===============

b"B'B" のように、シングルクォート１文字存在するときは、
ダブルクォートで括った b" " で、
シングルクォートとダブルクォート両方存在する場合は、
シングルクォートがエスケープされてくる。これを正規表現で捕まえるのは結構たいへんで
以下、長い正規表現が必要だ。
\p{ASCII}であってシングルクォートとダブルクォートを除くという正規表現があれば
もっとスマートな正規表現になるであろうが、以下を java.util.regex.Pattern で
処理する。

(b'(\\'|[ !#$%&\(\)\*\+,\-\./:;<=>?@\[\]^_`\{\|\}\w\\"])+')|(b\"[ !#$%&\(\)\*\+,\-\./:;<=>?@\[\]^_`\{\|\}\w\\\']+\")|b''

正規表現だけでなく、
シングルクォートのエスケープが残ってしまうので、

 .map(e->e.replaceAll("\\\\'", "'"))

も必要であった。

b'\\u3042\\u3044\\u3046\\u30481234'
を捕捉後の UTF-8 への変換は、ASCII とユニコード表現が混じっているので、
先日の Unicodes の parse メソッドを使用する。

http://oboe2uran.hatenablog.com/entry/2019/07/02/215601

このように Java からプロセス起動で Python 実行して文字列を自由に渡し、
Python 実行結果として文字列を任意に受信認識できることは、
Java → Python → Java という流れを構成していく上で重要である。

IOTransfer なんて用意しなくても、Java9 以上なら、、、
oboe2uran.hatenablog.com

後日、、、、
以下の方が良いと思い投稿
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓

oboe2uran.hatenablog.com