查找包含 5 个及以上单词的引号内内容

例题描述和简单分析

有文本文件 txt.txt,如下所示:

Attorney General William Barr said the volume of information compromised was "staggering" and the largest breach in U.S. history."This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft." said Mr. Barr.

文本文件由多个数需要返回引号内的每个字符串,且字符串内单词数量要大于等于 5,结果如下:

This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft.

解法及简要说明

方法一:条件分组

在集算器中编写脚本 p1.dfx,如下所示:


A

1

=file("txt.txt").read()

2

=A1.words@w().group@i(~[-1]=="\"").select(#%2==0   && ~.count(isalpha(~))>=5).(~.m(:-2).concat())

3

=file("result.txt").export(A2)

简要说明:

A1   文本文件读成串

A2  用 words 函数拆分出字符串中的英语单词,@w 表示拆出串中的所有字符,英文 / 数字拆成单词;对拆完的序列,按条件(上一个成员值是引号)分组,选出偶数行且当前成员(序列)中成员值是由字母构成的,个数大于等于 5 的那些成员,每个成员(序列)除最后一个成员外,拼成串。

A3  结果导出至 result.txt

方法二:正则表达式

在集算器中编写脚本 p1.dfx,如下所示:


A

1

=file("txt.txt").read()

2

=A1.regex("\"([^\"]*)\"").select(~.words().len()>=5)

3

=file("result.txt").export(A2)

简要说明:

A1   文本文件读成串

A2  用正则表达式找出所有引号内的串,再找出单词个数大于等于 5 个的那些串

A3  结果导出至 result.txt

问答搜集

https://stackoverflow.com/questions/60310558/regex-to-match-quote-with-minimum-number-of-words