在Linux上识别同样内容的文件详解

2023-08-22 11:55:04 258

前言

有时文件副本相当于对硬盘空间的巨大浪费，并会在你想要更新文件时造成困扰。以下是用来识别这些文件的六个命令。

在最近的帖子中，我们看了如何识别并定位硬链接的文件（即，指向同一硬盘内容并共享inode）。在本文中，我们将查看能找到具有相同内容，却不相链接的文件的命令。

硬链接很有用是因为它们能够使文件存放在文件系统内的多个地方却不会占用额外的硬盘空间。另一方面，有时文件副本相当于对硬盘空间的巨大浪费，在你想要更新文件时也会有造成困扰之虞。在本文中，我们将看一下多种识别这些文件的方式。

用diff命令比较文件

可能比较两个文件最简单的方法是使用diff命令。输出会显示你文件的不同之处。<和>符号代表在当参数传过来的第一个（<）或第二个（>）文件中是否有额外的文字行。在这个例子中，在backup.html中有额外的文字行。

$diffindex.htmlbackup.html
2438a2439,2441
>>That'sallthereistoreport.
>

如果diff没有输出那代表两个文件相同。

$diffhome.htmlindex.html
$

diff的唯一缺点是它一次只能比较两个文件并且你必须指定用来比较的文件，这篇帖子中的一些命令可以为你找到多个重复文件。

使用校验和

cksum（checksum）命令计算文件的校验和。校验和是一种将文字内容转化成一个长数字（例如2819078353228029）的数学简化。虽然校验和并不是完全独有的，但是文件内容不同校验和却相同的概率微乎其微。

$cksum*.html
2819078353228029backup.html
4073570409227985home.html
4073570409227985index.html

在上述示例中，你可以看到产生同样校验和的第二个和第三个文件是如何可以被默认为相同的。

使用find命令

虽然find命令并没有寻找重复文件的选项，它依然可以被用来通过名字或类型寻找文件并运行cksum命令。例如：

$find.-name"*.html"-execcksum{}\;
4073570409227985./home.html
2819078353228029./backup.html
4073570409227985./index.html

使用fslint命令

fslint命令可以被特地用来寻找重复文件。注意我们给了它一个起始位置。如果它需要遍历相当多的文件，这就需要花点时间来完成。注意它是如何列出重复文件并寻找其它问题的，比如空目录和坏ID。

$fslint.
-----------------------------------filenamelint
-------------------------------Invalidutf8names
-----------------------------------filecaselint
----------------------------------DUPlicatefiles<==
home.html
index.html
-----------------------------------Danglinglinks
--------------------redundantcharactersinlinks
------------------------------------suspectlinks
--------------------------------EmptyDirectories
./.gnupg
----------------------------------TemporaryFiles
----------------------duplicate/conflictingNames
------------------------------------------Badids
-------------------------NonStrippedexecutables

你可能需要在你的系统上安装fslint。你可能也需要将它加入你的命令搜索路径：

$exportPATH=$PATH:/usr/share/fslint/fslint

使用rdfind命令

rdfind命令也会寻找重复（相同内容的）文件。它的名字意即“重复数据搜寻”，并且它能够基于文件日期判断哪个文件是原件——这在你选择删除副本时很有用因为它会移除较新的文件。

$rdfind~
Nowscanning"/home/shark",found12files.
Nowhave12filesintotal.
Removed1filesduetononuniquedeviceandinode.
Totalsizeis699498bytesor683KiB
Removed9filesduetouniquesizesfromlist.2filesleft.
Noweliminatingcandidatesbasedonfirstbytes:removed0filesfromlist.2filesleft.
Noweliminatingcandidatesbasedonlastbytes:removed0filesfromlist.2filesleft.
Noweliminatingcandidatesbasedonsha1checksum:removed0filesfromlist.2filesleft.
Itseemslikeyouhave2filesthatarenotunique
Totally,223KiBcanbereduced.
Nowmakingresultsfileresults.txt

你可以在dryrun模式中运行这个命令（换句话说，仅仅汇报可能会另外被做出的改动）。

$rdfind-dryruntrue~
(DRYRUNMODE)Nowscanning"/home/shark",found12files.
(DRYRUNMODE)Nowhave12filesintotal.
(DRYRUNMODE)Removed1filesduetononuniquedeviceandinode.
(DRYRUNMODE)Totalsizeis699352bytesor683KiB
Removed9filesduetouniquesizesfromlist.2filesleft.
(DRYRUNMODE)Noweliminatingcandidatesbasedonfirstbytes:removed0filesfromlist.2filesleft.
(DRYRUNMODE)Noweliminatingcandidatesbasedonlastbytes:removed0filesfromlist.2filesleft.
(DRYRUNMODE)Noweliminatingcandidatesbasedonsha1checksum:removed0filesfromlist.2filesleft.
(DRYRUNMODE)Itseemslikeyouhave2filesthatarenotunique
(DRYRUNMODE)Totally,223KiBcanbereduced.
(DRYRUNMODE)Nowmakingresultsfileresults.txt

rdfind命令同样提供了类似忽略空文档（-ignoreempty）和跟踪符号链接（-followsymlinks）的功能。查看man页面获取解释。

-ignoreemptyignoreemptyfiles
-minsizeignorefilessmallerthanspeficiedsize
-followsymlinksfollowsymboliclinks
-removeidentinoderemovefilesreferringtoidenticalinode
-checksumidentifychecksumtypetobeused
-deterministicdeterminesshowtosortfiles
-makesymlinksturnduplicatefilesintosymboliclinks
-makehardlinksreplaceduplicatefileswithhardlinks
-makeresultsfilecreatearesultsfileinthecurrentdirectory
-outputnameprovidenameforresultsfile
-deleteduplicatesdelete/unlinkduplicatefiles
-sleepsetsleeptimebetweenreadingfiles(milliseconds)
-n,-dryrundisplaywhatwouldhavebeendone,butdon'tdoit

注意rdfind命令提供了-deleteduplicatestrue的设置选项以删除副本。希望这个命令语法上的小问题不会惹恼你。;-)

$rdfind-deleteduplicatestrue.
...
Deleted1files.<==

你将可能需要在你的系统上安装rdfind命令。试验它以熟悉如何使用它可能是一个好主意。

使用fdupes命令

fdupes命令同样使得识别重复文件变得简单。它同时提供了大量有用的选项——例如用来迭代的-r。在这个例子中，它像这样将重复文件分组到一起：

$fdupes~
/home/shs/UPGRADE
/home/shs/mytwin

/home/shs/lp.txt
/home/shs/lp.man

/home/shs/penguin.png
/home/shs/penguin0.png
/home/shs/hideme.png

这是使用迭代的一个例子，注意许多重复文件是重要的（用户的.bashrc和.profile文件）并且不应被删除。

#fdupes-r/home
/home/shark/home.html
/home/shark/index.html

/home/dory/.bashrc
/home/eel/.bashrc

/home/nemo/.profile
/home/dory/.profile
/home/shark/.profile

/home/nemo/tryme
/home/shs/tryme

/home/shs/arrow.png
/home/shs/PNGs/arrow.png

/home/shs/11/files_11.zip
/home/shs/ERIC/file_11.zip

/home/shs/penguin0.jpg
/home/shs/PNGs/penguin.jpg
/home/shs/PNGs/penguin0.jpg

/home/shs/Sandra_rotated.png
/home/shs/PNGs/Sandra_rotated.png

fdupe命令的许多选项列如下。使用fdupes-h命令或者阅读man页面获取详情。

-r--recurserecurse
-R--recurse:recursethroughspecifieddirectories
-s--symlinksfollowsymlinkeddirectories
-H--hardlinkstreathardlinksasduplicates
-n--noemptyignoreemptyfiles
-f--omitfirstomitthefirstfileineachsetofmatches
-A--nohiddenignorehiddenfiles
-1--samelinelistmatchesonasingleline
-S--sizeshowsizeofduplicatefiles
-m--summarizesummarizeduplicatefilesinformation
-q--quiethideprogressindicator
-d--deletepromptuserforfilestopreserve
-N--nopromptwhenusedwith--delete,preservethefirstfileinset
-I--immediatedeleteduplicatesastheyareencountered
-p--permissionsdon'tsonciderfileswithdifferentowner/groupor
permissionbitsasduplicates
-o--order=WORDorderfilesaccordingtospecification
-i--reversereverseorderwhilesorting
-v--versiondisplayfdupesversion
-h--helpdisplayshelp

fdupes命令是另一个你可能需要安装并使用一段时间才能熟悉其众多选项的命令。

总结

Linux系统提供能够定位并（潜在地）能移除重复文件的一系列的好工具，以及能让你指定搜索区域及当对你所发现的重复文件时的处理方式的选项。

好了，以上就是这篇文章的全部内容了，希望本文的内容对大家的学习或者工作具有一定的参考学习价值，谢谢大家对毛票票的支持。

via:https://www.networkworld.com/article/3390204/how-to-identify-same-content-files-on-linux.html#tk.rss_all

作者：SandraHenry-Stocker选题：lujun9972译者：tomjlw校对：wxy

在Linux上识别同样内容的文件详解

热门推荐

随机推荐