2025/08/18
Share
Development of PDF Difference Detection Software
Project Overview
The client needed to confirm that there were no discrepancies between PDF reports generated by the new system after migration and those generated by the old system. The purpose of this project was to build dedicated software capable of detecting differences in PDF files to meet this need.
This software is equipped with a function to compare differences in content, layout, and display position between PDF files. When discrepancies are found, it can output detailed, visually intuitive reports in image and HTML formats, enabling users to easily identify the differences.
Technology Stack and Development Tools
・Programming Language: PHP7.4
・Task Management Tool: Backlog
The Client’s Challenges
After a system migration, the client needed to verify the integrity of existing functionalities, especially the PDF report generation feature.
The specific problems were:
- The need to compare a large number of PDF files generated from both the old and new systems.
- Potential for misalignments in display position, spacing, or fonts, even if the content was identical.
- Manual comparison was time-consuming and prone to human error.
The Client’s Requirements
- The ability to simultaneously compare two or more PDF files from the old and new systems.
- The capability to detect the following discrepancies:
- Differences in text content.
- Differences in display position (to be detected even if the content is the same).
- Differences in overall layout (spacing, margins, alignment, etc.).
- The ability to output comparison reports in the following formats:
- Image format (highlighting the areas with discrepancies).
- HTML format (displaying the PDF content and highlighting the differences).
- A simple, user-friendly interface that operates as a standalone application without requiring an internet connection.
Our Proposal and Approach
We proposed the development of a console application (run from the command line) specialized for checking the consistency of PDF files.
Since the client is an IT company, a console-based format is easy to integrate into their testing processes and automation scripts, making it well-suited for their technical workflow.
Our primary approach included the following:
- Automatically comparing text content between PDF files, matching them paragraph by paragraph and line by line.
- Detecting and marking discrepancies even when the content was identical but the display position differed.
- Generating visual reports:
- Creating comparison images that highlight the differing sections.
- Generating HTML reports that display the PDF content and highlight the discrepancies.
- Streamlining the verification process with a feature to compare multiple PDF files simultaneously.
- This solution enabled the client’s testing and technical teams to quickly and accurately perform migration verification for large-scale systems, especially when dealing with a high volume of complex PDF reports.