sciencescraper.pmc.pmc_clean
#
Functions to clean the data extracted from PMC.
Module Contents#
Functions#
|
Returns the full text of the article, excluding figures, tables, |
|
Splits a given text into chunks of approximately 'chunk_size' words. |
- sciencescraper.pmc.pmc_clean.clean_full_text(pmc_article, chunk_size)[source]#
Returns the full text of the article, excluding figures, tables, and supplementary information sections, and removes reference numbers
- Parameters:
pmc_article (BeautifulSoup) – The article as a BeautifulSoup object
chunk_size (int) – The size of the chunks to split the full text into
- Returns:
full_text – The full text of the article, excluding figures, tables, supplementary information sections, and reference numbers
- Return type:
str
- sciencescraper.pmc.pmc_clean.split_into_chunks(text, chunk_size)[source]#
Splits a given text into chunks of approximately ‘chunk_size’ words.
- Parameters:
text (str) – The text to split into chunks.
chunk_size (int) – The size of the chunks to split the text into.
- Returns:
List of the text split into chunks.
- Return type:
list of str