dsresumatch.pdf_cv_processing¶
Functions¶
|
Extract text content from a PDF file and return it as a single consolidated string. |
|
Convert raw_text to lowercase, remove punctuation, and filter out common English stop words |
|
Count the frequency of words in a PDF file. |
Module Contents¶
- dsresumatch.pdf_cv_processing.read_pdf(file_path)[source]¶
Extract text content from a PDF file and return it as a single consolidated string.
- file_pathstr
Path to the PDF file.
- str
PDF file contents as text.
>>> read_pdf("cv.pdf") 'Work Experience
Software Developer at XYZ Corp. Education Bachelor of Science in Computer Science ‘
- dsresumatch.pdf_cv_processing.clean_text(raw_text)[source]¶
Convert raw_text to lowercase, remove punctuation, and filter out common English stop words to retain only meaningful words in the string.
- Parameters:
raw_text (str) – Text to clean.
- Returns:
Cleaned text.
- Return type:
str
Examples
>>> clean_text("Work Experience: Software Developer at XYZ Corp!") 'work experience software developer xyz corp'
- dsresumatch.pdf_cv_processing.count_words_in_pdf(file_path)[source]¶
Count the frequency of words in a PDF file.
This function converts all words to lowercase, removing punctuation, and excluding common English stop words to ensure meaningful word counts.
- Parameters:
file_path (str) – Path to the PDF file.
- Returns:
Dictionary-like object with the frequency of each remaining word where keys are words and values are counts.
- Return type:
collections.Counter
Examples
>>> count_words_in_pdf("cv.pdf") Counter({'work': 1, 'experience': 1, 'software': 1, 'developer': 1, 'at': 1, 'xyz': 1, 'corp': 1, 'education': 1, 'bachelor': 1, 'of': 1, 'science': 1, 'in': 1, 'computer': 1})