Data
Our project can be used to download evaluation datasets, thumbnails, YouTube-videos and allows to scrape images from Google.
Datasets
- src.data.datasets.download_imdb_faces_dataset(path: str = 'data/datasets/imdb-faces')[source]
Downloads the IMDb-Faces dataset and parses a information.csv. Details about the dataset can be found here: https://github.com/fwang91/IMDb-Face.
!!! Many links are outdated. Only half of the dataset can still be downloaded. !!!
- Parameters
path (str) – Path where the videos and information.csv should be saved.
- src.data.datasets.download_imdb_wiki_dataset(path: str = 'data/datasets/imdb-wiki')[source]
Downloads the IMDb-Wiki dataset and parses a information.csv. Details about the dataset can be found here: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/.
- Parameters
path (str) – Path where the videos and information.csv should be saved
- src.data.datasets.download_seqamlab_dataset(path: str = 'data/datasets/ytcelebrity')[source]
Downloads the YouTube Celebrities Face Tracking and Recognition Dataset and parses a information.csv. Details about the dataset can be found here: http://seqamlab.com/youtube-celebrities-face-tracking-and-recognition-dataset/.
- Parameters
path (str) – Path where the videos and information.csv should be saved.
- src.data.datasets.download_youtube_faces_db(path: str = 'data/datasets/youtube-faces-db', download: bool = False)[source]
Downloads the YouTube Faces Database and parses a information.csv. Details about the dataset can be found here: https://www.cs.tau.ac.il/~wolf/ytfaces/.
- Parameters
path (str) – Path where the videos are located and the information.csv should be saved at.
download (bool) – Whether the dataset should be downloaded automatically or only parsed. The download can take long.
Thumbnails
- src.data.knowledge_graphs.download_dbpedia_thumbnails(path: str = 'data/thumbnails/dbpedia_thumbnails', query_links: bool = True, download: bool = True)[source]
- Queries the thumbnail links from dbpedia and saves the links in a file path/Thumbnails_links.csv
Downloads the thumbnails of dbpedia and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails
- Parameters
path (str) – ath where the thumbnails should be saved at
query_links (bool) – Boolean that indicates whether to query the thumbnails links
download (bool) – Boolean that indicated whether to download the thumbnails
- src.data.knowledge_graphs.download_entity_list(path: str = 'data/thumbnails', entity_list: Optional[list] = None)[source]
Downloads a specific list of entity thumbnails from wikidata.
- Parameters
path (str) – Path where the thumbnails are stored.
entity_list (list) – A list of entities required to download
- Returns
A list containing the still missing entities
- Return type
sm (list)
- src.data.knowledge_graphs.download_images(path, method='wikidata')[source]
Downloads entity thumbnails
- Parameters
path (str) – Path where the thumbnails should be stored and the Thumbnails_links.csv is located.
method (str) – Source knowledge graph.
- src.data.knowledge_graphs.download_missing_thumbnails(path: str = './videos/ytcelebrity', path_thumbnails: str = 'data/thumbnails')[source]
Compares a list of entities with the ones in a dataset and downloads missing ones.
- Parameters
path (str) – Path where the information.csv of the dataset is saved.
path_thumbnails (str) – Path where the Thumbnails_links.csv is saved.
- Returns
List of missing entities that have ben found.
- Return type
missing_entities (list)
- src.data.knowledge_graphs.download_thumbnail(index: int, i_thumbnail_url: str, i_path: str, i_file_name: str)[source]
Downloads a single thumbnail
- Parameters
index (int) – The index of the downloaded thumbnail taken from the thumbnail urls dataframe
i_thumbnail_url (str) – The url of the downloaded thumbnail
i_path (str) – The download path
i_file_name (str) – The file name
- Returns
A list containing the index, the thumbnail url and the result outcome (success, HTTPError or UnicodeEncodeError)
- Return type
output (list)
- src.data.knowledge_graphs.download_wikidata_thumbnails(path: str = 'data/thumbnails/wikidata_thumbnails', query_links: bool = True, download: bool = True)[source]
- Queries the thumbnail links from wikidata and saves the links in a file path/Thumbnails_links.csv
Downloads the thumbnails of wikidata and parses them in the following structure: <Entity1> <Thumbnail1> <Entity2> <Thumbnail1> Saves a summary of the results in path/download_results.csv Saves the images in path/thumbnails
- Parameters
path (str) – Path where the thumbnails should be saved at
query_links (bool) – Boolean that indicates whether to query the thumbnails links
download (bool) – Boolean that indicated whether to download the thumbnails
- src.data.knowledge_graphs.get_same_as_link(uri: str) str[source]
Gets the corresponding Wikidata/DBpedia uri for a DBpedia/Wikidata uri.
- Parameters
uri (str) – A DBpedia- or Wikidata-URI.
- Returns
The uri of the other knowledge graph.
- Return type
corresponding_uri (str)
- src.data.knowledge_graphs.get_uri_from_csv(name: str, data: pandas.core.frame.DataFrame)[source]
Gets the DBpedia- and Wikidata-URI from a Thumbnail_links.csv as a dataframe.
- Parameters
name (str) – Name of the entity.
data (DataFrame) – Dataframe of the Thumbnail_links.csv
- Returns
The uri of the entity in DBpedia. wikidata_uri (str): The uri of the entity in Wikidata.
- Return type
dbpedia_uri (str)
Scraping from Google Images
- src.data.enrich_with_photos.compare_install_face(img, img_dir, downloaded, encode=None)[source]
Downloads an image only if the detected face is to an extent similar to the other images
- src.data.enrich_with_photos.download_images(path, main_keyword, supplemented_keywords, download_dir, num_images, encode=None)[source]
download images with one main keyword and multiple supplemented keywords
- Parameters
main_keyword (str) – main keyword
supplemented_keywords (list[str]) – list of supplemented keywords
- Returns
None
- src.data.enrich_with_photos.download_page(url: str)[source]
download raw content of the page
- Parameters
url (str) – URL of the page
- Returns
Raw content of the page
- Return type
content (str)
- src.data.enrich_with_photos.download_thumbnails_entity_list(download_dir, entity_list, num_images, enrich=True)[source]
- src.data.enrich_with_photos.encode_downloaded_img(img)[source]
Creates the embedding for an image
- Parameters
img – The image to encode
- Returns
embedding
YouTube
- src.data.youtube.download_youtube_video(url: str, path: str = 'data/datasets/youtube') str[source]
Downloads a single video from youtube.
- Parameters
url (str) – YouTube-link of the video to download.
path (str) – Path where the videos should be saved.
- Returns
Path to the downloaded file.
- Return type
path (str)
- src.data.youtube.download_youtube_videos(txt_path: Optional[str] = None, path: str = 'data/datasets/youtube') list[source]
Downloads videos from youtube
- Parameters
txt_path (str) – Location of a text-file containing line-wise URLs of youtube videos.
path (str) – Path where the videos should be saved.
- Returns
List of paths to the downloaded videos.
- Return type
video_paths (list)