dsresumatch.pdf_cv_processing¶

Functions¶

`read_pdf`(file_path)	Extract text content from a PDF file and return it as a single consolidated string.
`clean_text`(raw_text)	Convert raw_text to lowercase, remove punctuation, and filter out common English stop words
`count_words_in_pdf`(file_path)	Count the frequency of words in a PDF file.

Module Contents¶

dsresumatch.pdf_cv_processing.read_pdf(file_path)[source]¶

Extract text content from a PDF file and return it as a single consolidated string.

file_pathstr
Path to the PDF file.

str
PDF file contents as text.
>>> read_pdf("cv.pdf")
'Work Experience

Software Developer at XYZ Corp. Education Bachelor of Science in Computer Science ‘

dsresumatch.pdf_cv_processing.clean_text(raw_text)[source]¶

Convert raw_text to lowercase, remove punctuation, and filter out common English stop words to retain only meaningful words in the string.

Parameters:: raw_text (str) – Text to clean.
Returns:: Cleaned text.
Return type:: str

Examples

>>> clean_text("Work Experience: Software Developer at XYZ Corp!")
'work experience software developer xyz corp'

dsresumatch.pdf_cv_processing.count_words_in_pdf(file_path)[source]¶

Count the frequency of words in a PDF file.

This function converts all words to lowercase, removing punctuation, and excluding common English stop words to ensure meaningful word counts.

Parameters:: file_path (str) – Path to the PDF file.
Returns:: Dictionary-like object with the frequency of each remaining word where keys are words and values are counts.
Return type:: collections.Counter

Examples

>>> count_words_in_pdf("cv.pdf")
Counter({'work': 1, 'experience': 1, 'software': 1, 'developer': 1, 'at': 1, 'xyz': 1,
'corp': 1, 'education': 1, 'bachelor': 1, 'of': 1, 'science': 1, 'in': 1, 'computer': 1})