10th Indian Delegation to Dubai, Gitex & Expand North Star – World’s Largest Startup Investor Connect
All News

Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here

RedPajama has unveiled the latest version of its dataset, RedPajama-Data-v2, which is a colossal repository of web data aimed at advancing language model training. This dataset encompasses a staggering 30 trillion tokens, meticulously filtered and deduplicated from a raw pool of over 100 trillion tokens, sourced from 84 CommonCrawl data dumps in five languages, including English, French, Spanish, German, and Italian. 

Click here to check out the GitHub repository.

RedPajama-Data-v2 comes with a remarkable addition of 40+ pre-computed data quality annotations that offer invaluable tools for further data filtering and weighting.

The dataset covers 5 languages, with 40+ pre-computed data quality annotations that can be used for further filtering and weighting. Here is one example of how to filter RedPajama-Data-v2 in a similar way as Gopher: pic.twitter.com/VqKObX9Iqr

— Together AI (@togethercompute) October 30, 2023

Over the past six months, the impact of RedPajama’s previous release, RedPajama-1T, has been profound in the language model community. This 5TB dataset of high-quality English tokens has been downloaded by more than 190,000 individuals, who have harnessed its potential in creative ways. 

RedPajama-1T served as a stepping stone towards the goal of creating open datasets for language model training, but RedPajama-Data-v2 takes this ambition to new heights with its mammoth 30 trillion token web dataset.

RedPajama-Data-v2 stands out as the largest public dataset specifically crafted for LLM training, significantly contributing to the field. Most notably, it introduces 40+ pre-computed quality annotations, empowering the community to enhance the dataset’s utility. This release encompasses over 100 billion text documents derived from 84 CommonCrawl data dumps, constituting a total of 100+ trillion raw tokens.

Together.AI says that the dataset offers a solid foundation for advancing state-of-the-art open LLMs such as Llama, Mistral, Falcon, MPT, and the RedPajama models. 

RedPajama-Data-v2 primarily focuses on CommonCrawl data, while data sources such as Wikipedia are available in RedPajama-Data-v1. To further enrich the dataset, users are encouraged to integrate Stack (by BigScience) for code-related content and s2orc (by AI2) for scientific articles. RedPajama-Data-v2 is meticulously crafted from publicly available web data, comprising the core elements of plain text source data, 40+ quality annotations, and deduplication clusters.

The process of creating the source data begins with each CommonCrawl snapshot passing through the CCNet pipeline, chosen for its light processing approach, preserving raw data integrity. This results in the generation of 100 billion individual text documents, maintaining alignment with the overarching principle of data preservation.

The post Open Source RedPajama-Data-v2 with 30 Trillion Tokens is Here appeared first on Analytics India Magazine.

by Siliconluxembourg

Would-be entrepreneurs have an extra helping hand from Luxembourg’s Chamber of Commerce, which has published a new practical guide. ‘Developing your business: actions to take and mistakes to avoid’, was written to respond to  the needs and answer the common questions of entrepreneurs.  “Testimonials, practical tools, expert insights and presentations from key players in our ecosystem have been brought together to create a comprehensive toolkit that you can consult at any stage of your journey,” the introduction… Source link

by WIRED

B&H Photo is one of our favorite places to shop for camera gear. If you’re ever in New York, head to the store to check out the giant overhead conveyor belt system that brings your purchase from the upper floors to the registers downstairs (yes, seriously, here’s a video). Fortunately B&H Photo’s website is here for the rest of us with some good deals on photo gear we love. Save on the Latest Gear at B&H Photo B&H Photo has plenty of great deals, including Nikon’s brand-new Z6III full-frame… Source link

by Gizmodo

Long before Edgar Wright’s The Running Man hits theaters this week, the director of Shaun of the Dead and Hot Fuzz had been thinking about making it. He read the original 1982 novel by Stephen King (under his pseudonym Richard Bachman) as a boy and excitedly went to theaters in 1987 to see the film version, starring Arnold Schwarzenegger. Wright enjoyed the adaptation but was a little let down by just how different it was from the novel. Years later, after he’d become a successful… Source link