"【问题】 I have a couple directories that include tens of thousands of html, txt, csv and other txt .."

swan
离开的是风景，留下的是人生
1,188 浏览 • 5 年前

合并文件系统中的满足条件的文件的行

桌面处理

【问题】

I have a couple directories that include tens of thousands of html, txt, csv and other txt based files in them.
I want to output a particular line number from each one as a result file.
I need to do this as efficiently/quickly/easily as possible on a windows7 machine (sorry!).
HOW???

I am currently using textpad and notepad++ using their "find in files" for a particular string that is in every file on the same line number but I think that there should be a tool out there that can give me the same result more efficiently/quickly by simply going straight fir the same line number of the files (200k) in the subdirectories

I am trying to extract all the line#123 from each of the 200k files and put those lines into a new text file.
I don't need to replace or edit...

There are multiple folders with 10k to 200k files in each. That is one of the things I am avoiding... opening those folders and subfolders as even with 16gb ddr3 on dual quadcore is too slow/error prone and resource intensive

【回答】

这个问题的解决思路很清晰：递归取得多级目录中的文件列表，再依此打开文件读取数据，每读完一个目录就追加一次结果。命令行难以处理较复杂的过程，不太适合此类算法。高级语言虽然可以实现这种算法，但代码比较难写，再加上可能存在大文件，处理起来会更加困难。SPL支持游标读取大文件、脚本递归调用，易于实现此类算法，代码如下：

	A
1	=directory@p(path)
2	=A1.(file(~).cursor@s())
3	=A2.((~.skip(122),~.fetch@x(1)))
4	=A3.union()
5	=file("d:\\result.txt").export@a(A4)
6	=directory@dp(path)
7	if A6.len()==0
8		return
9	else
10		=A6.(call("c:\\readfile.dfx",~))

A1：取得当前目录的文件列表，path是参数，初始值是根目录。

A2：以游标的方式依次打开A1中的每个文件，不占内存。A1.(…)表示对A1的成员依次进行计算，~用来表示当前成员，函数file用来建立文件对象。

A3：从A2的每个文件游标中跳过前122行，读第123行。A2.(…)表示对A2中的每个游标依次计算，(~.skip(122),~.fetch@x(1))表示依次计算括号内的表达式，并返回最后一个表达式的结果。其中~.skip(122)表示跳过前122行，~.fetch@x(1)表示从当前位置读取一行（即第123条）并关闭游标，选项@x表示取完数据后自动关闭游标，~.fetch@x(1)就是括号运算符要返回的结果。

A4：合并A3的计算结果。

A5：将A4追加到结果文件result.txt。

上述A1-A5已经完成了当前目录里文件的抽取，下面只要取出当前目录的子目录，并递归调用本脚本即可。

A6：取得当前目录的子目录列表。函数directory可以取出当前目录的所有子目录，选项d表示取目录名，选项p表示取全路径。

A7-B8：如果没有子目录则直接返回。

A9-B10：递归调用本程序。依次对A6中的成员（各子目录）进行计算，算法是：调用SPL脚本c:\\readfile.dfx，并将当前成员（子目录）作为入口参数。注意：readfile.dfx就是本脚本的文件名。

通过递归调用，SPL就可以对path下的多级目录进行批量抽取。

文件合并(13)