README
pl-copyfind
A plagarism comparing function
This project was inspired by the work of Dr Lou Bloomfield's CopyFind/WCopyFind windows programs (http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/). Algorithmically there is an equivalence, however there are marked differences.
Key Differences:
pl-copyfind does not:
~download and extract the 'text' from various file formats. Depending upon the platform (either nodejs or a browser), there are a few solutions to this:
- mozilla's 'readbility' functions, which generally does excellent work at extracting only the main article html from web pages
- npm package
textract
. This is a 'one stop shop' for reading in a plethora of file formats. Note that this cannot be used in a browser solution, as it requires external binaries to be installed (although a server-side solution can be used for this). - the demo package illustrates a poor man's method of converting html to text, using purely regex's.
~generate output files, although an optional html output is available.
pl-copyfind does:
- ~have equivalent switches that the original program uses. You can ignore all of them and run your own sanitisers however.
- ~allow the 'hashes' to be stored in a cache. This is up to you to implement (although easy using one of many npm file caching packages).
- ~allow multple inputs (comparators and comparatees).
Default Options:
PhraseLength: 6, // Shortest Phrase to Match
WordThreshold: 100, // Fewest Matches to Report
SkipLength: 20, // Needs bSkipLongWords. words this long are skipped
MismatchTolerance: 2, // #Most Imperfections to Allow
MismatchPercentage: 80, // Minimum % of Matching Words
bIgnoreCase: false, // Ignore Letter Case
bIgnoreNumbers: false, // Ignore Numbers
bIgnoreOuterPunctuation: false, // Ignore Outer Punctuation
bIgnorePunctuation: false, // Ignore Punctuation
bSkipLongWords: false, // Skip Long Words
bSkipNonwords: false, // Skip Non-Words
bBuildReport: true, // generate html output
bBriefReport: true, // show a html report of matches with lead in and out text, for context (otherwise shows full source text). Needs bBuildReport
bTerseReport: false // show ONLY the matching text. Needs bBuildReport
Usage:
See the demos folder for a complete working example that does not require a web server to execute (just open index.html from your local file system to try it out).
Example 1. Single input comparison
var copyfind = require('pl-copyfind');
...
var options = { PhraseLength: 3, WordThreshold: 3, bIgnoreCase:true};
var src_text = "original text is here. lorem ipsum dolorem est";
var check_text = "I plagiarised lorem ipsum DOLOREM est and I reckon I can get away with it";
copyfind(src_text, check_text, options, function(err, data) {
if (err)
throw "Failed to compare: " + err.toString();
if (!data.matches.length)
return false; // no comparison found
console.log("Found " + data.matches.length + " matches"); // expect 1
for (var i=0; i<data.matches.length; i++) {
var match = data.matches[i];
var orig_text = src_text.substr(match.textL.pos, match.textL.length);
var copied_text = check_text.substr(match.textR.pos, match.textR.length);
console.log("Match found: \n" + orig_text + "\nvs. \n" + copied_text + "\nat position : " + match.textR.pos);
}
});
Example 2. Multiple input comparisons
var copyfind = require('pl-copyfind');
...
var options = { PhraseLength: 3, WordThreshold: 3 };
var src_texts = ["original text is here. lorem ipsum dolorem est","This is another original text that is also dolorem est"];
var check_texts = ["I plagiarised lorem ipsum dolorem est and I reckon I can get away with it","I didn't do lorem est this time"];
copyfind(src_texts, check_texts, options, function(err, data) {
if (err)
throw "Failed to compare: " + err.toString();
if (!data.matches.length)
return false; // no comparison found on any text
for (var l=0; l<src_texts.length; l++) {
for (var r=0; r<check_texts.length; r++) {
for (var i=0; i<data.matches[l][r].length; i++) {
var match = data.matches[l][r][i];
var orig_text = src_texts[l].substr(match.textL.pos, match.textL.length);
var copied_text = check_texts[r].substr(match.textR.pos, match.textR.length);
console.log("Match found: #["+l+"]\n" + orig_text + "\nvs. #["+r+"]\n" + copied_text + "\nat position : " + match.textR.pos);
}
}
}
});
Example 3. Render html reports
var copyfind = require('pl-copyfind');
...
var options = { bBuildReport: true };
var src_text = "original text is here. lorem ipsum dolorem est";
var check_text = "I plagiarised lorem ipsum dolorem est and I reckon I can get away with it";
copyfind(src_texts, check_text, options, function(err, data) {
if (err)
throw "Failed to compare: " + err.toString();
alert(data.html);
});
Example 4. Cache results for faster re-comparisons
var copyfind = require('pl-copyfind');
...
var options = { };
var src_text = "original text is here. lorem ipsum dolorem est";
var check_text1 = "I plagiarised lorem ipsum dolorem est and I reckon I can get away with with it";
var check_text2 = "Another plagiarised lorem ipsum dolorem est and I reckon I can get away with it";
copyfind(src_texts, check_text1, options, function(err, data) {
if (err)
throw "Failed to compare: " + err.toString();
alert("execution took " + data.executionTime + " ms");
options.hashesL = data.hashesL; // save the hashdata. You *could* store this in a file cache too
});
// re-uses hashesL for better performance
copyfind(src_texts, check_text2, options, function(err, data) {
if (err)
throw "Failed to compare: " + err.toString();
alert("execution took " + data.executionTime + " ms");
});
Licensing:
This module and all its source is licensed under GPL, which is the original licensing of WCopyFind/CopyFind source. The license file can be found at [https://github.com/cmroanirgo/pl-copyfind/blob/master/LICENSE.md].
Please note that if you use this library, as-is, then your project need not be subject to what is commonly called 'GPL cancer'. It is only if you embrace and extend the module that you must also release your source code, also under a GPL license. However, as all things go, it would be appreciated if attribution for the work done in this project was acknowledged in your source and information pages.