sciencescraper.sciencedirect.scidir_clean
#
Functions to clean the text extracted from ScienceDirect articles.
Module Contents#
Functions#
|
Clean the raw XML text of an ScienceDirect article to remove unnecessary information, |
|
Splits a given text into chunks of approximately 'chunk_size' words. |
- sciencescraper.sciencedirect.scidir_clean.clean_fulltext(xml_text, chunk_size)[source]#
Clean the raw XML text of an ScienceDirect article to remove unnecessary information, leaving only the full text of the article.
- Parameters:
xml_text (str) – The raw XML text of an article.
chunk_size (int) – The size of the chunks to split the full text into.
- Returns:
List of the full text of the article, split into chunks
- Return type:
list of str
- sciencescraper.sciencedirect.scidir_clean.split_into_chunks(text, chunk_size)[source]#
Splits a given text into chunks of approximately ‘chunk_size’ words.
- Parameters:
text (str) – The text to split into chunks.
chunk_size (int) – The size of the chunks to split the text into.
- Returns:
List of the text split into chunks.
- Return type:
list of str