dsresumatch.pdf_cv_processing

Functions

read_pdf(file_path)

Extract text content from a PDF file and return it as a single consolidated string.

clean_text(raw_text)

Convert raw_text to lowercase, remove punctuation, and filter out common English stop words

count_words_in_pdf(file_path)

Count the frequency of words in a PDF file.

Module Contents

dsresumatch.pdf_cv_processing.read_pdf(file_path)[source]

Extract text content from a PDF file and return it as a single consolidated string.

file_pathstr

Path to the PDF file.

str

PDF file contents as text.

>>> read_pdf("cv.pdf")
'Work Experience

Software Developer at XYZ Corp. Education Bachelor of Science in Computer Science ‘

dsresumatch.pdf_cv_processing.clean_text(raw_text)[source]

Convert raw_text to lowercase, remove punctuation, and filter out common English stop words to retain only meaningful words in the string.

Parameters:

raw_text (str) – Text to clean.

Returns:

Cleaned text.

Return type:

str

Examples

>>> clean_text("Work Experience: Software Developer at XYZ Corp!")
'work experience software developer xyz corp'
dsresumatch.pdf_cv_processing.count_words_in_pdf(file_path)[source]

Count the frequency of words in a PDF file.

This function converts all words to lowercase, removing punctuation, and excluding common English stop words to ensure meaningful word counts.

Parameters:

file_path (str) – Path to the PDF file.

Returns:

Dictionary-like object with the frequency of each remaining word where keys are words and values are counts.

Return type:

collections.Counter

Examples

>>> count_words_in_pdf("cv.pdf")
Counter({'work': 1, 'experience': 1, 'software': 1, 'developer': 1, 'at': 1, 'xyz': 1,
'corp': 1, 'education': 1, 'bachelor': 1, 'of': 1, 'science': 1, 'in': 1, 'computer': 1})