sciencescraper.pmc
#
This subpackage contains modules for scraping articles from PubMed Central.
Submodules#
pmc_extract
: Functions to extract information from the raw XML text of a PubMed Central article.pmc_scrape
: Functions for retrieving the clean text of PubMed Central articles.pmc_search
: Functions for searching for articles on PubMed Central.
The main functions of this subpackage are get_article_info
in pmc_scrape
and search_pmc
in pmc_search
.
get_article_info
retrieves the full text of a PubMed Central article using the PubMed Central API and returns a dictionary
containing the article’s information along with the full text of the article. search_pmc
searches PubMed Central for
articles based on a query and returns a list of PMC IDs for the search results.
Submodules#
Package Contents#
Functions#
|
Fetches and parses an article from PMC given a PMC ID |
|
Fetches the full text of an article from PMC given a PMC ID |
|
Searches PMC for articles given a query |
|
Get open access articles from PubMed Central that have been published after a specified date. |
- sciencescraper.pmc.get_article_info(pmc_id, chunk_size=None)[source]#
Fetches and parses an article from PMC given a PMC ID
- Parameters:
pmc_id (str) – The PMC ID of the article
chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.
- Returns:
article – The parsed article
- Return type:
dict
- sciencescraper.pmc.get_full_text(pmc_id, chunk_size=None)[source]#
Fetches the full text of an article from PMC given a PMC ID
- Parameters:
pmc_id (str) – The PMC ID of the article
chunk_size (int, optional) – The size of the chunks to split the full text into. Default is None.
- Returns:
full_text – The full text of the article
- Return type:
str
- sciencescraper.pmc.search_pmc(query, sort='relevance', mindate=None, maxdate=None, reldate=None, retstart=0, retmax=20)[source]#
Searches PMC for articles given a query
- Parameters:
query (str) – The query to search for
sort (str, optional) – The sorting order for the search results. Options are: - “relevance”: Sort by relevance - “pub_date”: Sort by publication date in descending order - “JournalName”: Sort by journal in ascending order - “Author”: Sort by first author in ascending order
mindate (str, optional) – The minimum date for the search results. Format is “YYYY/MM/DD”, “YYYY/MM”, or “YYYY”. Must also provide maxdate
maxdate (str, optional) – The maximum date for the search results. Format is “YYYY/MM/DD”, “YYYY/MM”, or “YYYY”. Must also provide mindate
reldate (str, optional) – The number of days to search back from the current date.
retstart (int, optional) – The index of the first article to return
retmax (int, optional) – The maximum number of articles to return
- Returns:
pmc_ids – The PMC IDs of the search results
- Return type:
list
- sciencescraper.pmc.check_new_articles(query, days, chunk_size=None)[source]#
Get open access articles from PubMed Central that have been published after a specified date.
- Parameters:
query (str) – The query to search for
days (int) – The number of days to search back from the current date.
chunk_size (int, optional) – The size of the chunks to split the full text into
- Returns:
pmc_articles – A list of dictionaries containing article information
- Return type:
list of dict